More on interoperability and Open Stack

There have been a number of useful discussions on interoperability issues here at the OpenStack Summit, including a panel discussion on Tuesday afternoon. It is, of course, a complicated question, with various dimensions: what do we mean by interoperability; how do we assess (or even quantify) it; who does it apply to; what are reasonable expectations; what should we do about it…. This is worthy of a really lengthy essay, but for now I’m just going to jot down a few ideas that seem important.

  • Interoperability is fundamentally about switching or adoption costs: the greater the degree of interoperability between two systems, the smaller the costs of switching from one to the other, or making them work together in some way.
  • Costs are measured relative to expectations, not absolutes. If I switch from an x86-based OpenStack cloud to one built on ARM servers, I’m going to expect some substantial costs, but I can hope that the user experience will be largely the same.
  • Interoperability applies to both service consumers and service providers. Users may be interested in moving a workload from the RackSpace Public Cloud to a Piston private cloud; XYZ Telco may want to switch their code base from Nebula to Cloudscaling.
  • Issues of interoperability are tied to ideas about branding. What is expected (or required) of an IaaS service or distribution that claims to be “OpenStack”? The community has gone back and forth over the years about whether conformance should be based on code or APIs. Today, the requirement is still that you are “running Nova and Swift” (the code), which is naturally unacceptable to the growing number of users who have deployed API-compatible alternatives to Swift.
  • There is a plan to create an OpenStack conformance testing system called RefStack. The (simplified) idea is that a service provider could submit the publicly-accessible end-point for an OpenStack deployment, and the RefStack system would perform a series of tests and come up with a compliance scorecard (not a pass/fail). It would be purely voluntary, but as several people on the panel pointed out, consumer pressure would probably lead to general adoption.
  • For RefStack to provide meaningful results, it seems to me that there needs to be an actual reference: an actual OpenStack deployment which scores 100% on RefStack. Specifications are always somewhat ambiguous, and we need to be able to say, “If the spec is unclear, or we disagree about what this means, the correct behavior is what THIS system actually does.” (This is straight out of the JCP.) Those who believe that everyone should run exactly the same code will argue that this is unnecessary, but they’re wrong: the operational semantics of an OpenStack system will always depend on the behavior of elements – hardware, code and configuration – which are beyond the scope of the OpenStack community.
  • The RefStack approach clearly shifts the focus from conformance based on shared code to conformance based on correct API semantics. This is as it should be. OpenStack is moving from being a collective experiment in developing a complex, open-source distributed system to becoming a mainstream component of the IT world. This can only happen if we recognize that most of the stakeholders are going to be outside the OpenStack developer community. The governance required for the code, created by a few hundred individuals, is going to be different from that needed for APIs consumed by tens of thousands of users.
  • This shift in governance is tied up with the issue that I raised in my blog piece earlier this week. Today, it is too easy for a low-level implementation choice to introduce an incompatible change to a user-facing API. The challenge for the OpenStack leadership is to figure out how to provide stability and predictability for the users of OpenStack without stifling the work of the implementors. This is a good problem to have.

That’s enough for now – I have to get over to the last day of the Summit for the session on comparing OpenStack and EC2 network architectures.

OpenStack, service contracts, and interoperability

We’re going into the second day of the OpenStack Summit here in Portland, and for me the most important item on the agenda is the last session of the day: a panel on interoperability. Several leading members of the OpenStack community have expressed their concerns about the risk of divergence between different OpenStack-based services. Josh McKenty of Piston has been very vocal on the topic, and is proposing the idea of a RefStack which could be used for compliance.

To those of us that worked on Java in the early days, this feels very familiar. Welcome to the JCP.

While RefStack may be a necessary part of the solution to the interoperability problem, I am convinced that it isn’t sufficient. The biggest concern that I have is about the OpenStack design process. Everybody seems to be busily designing individual features and mechanisms, but nobody appears to be responsible for the specification of the OpenStack service itself. And that’s a problem.

There was a great example of this yesterday: a minor discussion in the networking part of the Design Summit. The details are relatively unimportant: a proposal for a mechanism to allow Quantum to provide some network hop cost information to the Nova scheduler. (It was a pretty bad design, IMHO.) But what was interesting was the use case which the authors provided to motivate the discussion: how to optimize the placement of elements of a scale-out three-tier application. In order to exploit the proposed mechanism, some process would have to compute the cost (measured in hops, latency, or whatever) of every configuration of available resources that matched the application graph. The problem is immediately obvious: who could carry out this process? The Nova Scheduler can’t, because it has no way of predicting the topology that will result from a sequence of VM instantiations. The IaaS user can’t, because she doesn’t have access to the full list of Nova resources. Of course a declarative model (like that provided in the vCloud API) would do the trick, but OpenStack doesn’t offer such a service.

So in many ways this was a pretty futile discussion, and for me the main reason was simple: there were no service-level requirements. There was a free-floating use case, unanchored to any external API or user-level abstration.

Now I’m not saying that this is an unimportant problem. It isn’t. It is certainly useful to allow an OpenStack user to provide a hint that certain instances will be communicating intensively, and that the scheduler should try to place them in such a way as to minimize latency. (And we have a plausible example of a service-level abstraction that captures this: the EC2 “Placement Group” concept.) We also have the case where the user requires that instances are placed on different servers or racks, to provide increased availability. What should the right abstractions look like? How do we represent this in the APIs – both syntax and semantics? What kind of SLAs are appropriate? And are there any constraints or dependencies associated with this aspect of the service? All of this would seem to be necessary before diving into a discussion of intra-system mechanisms.

There is always a risk when building a complex system that the external interface will emerge as an almost accidental side-effect of the implementation process. For isolated implementations, this may be OK – the client can meet the service half-way. For durable platforms, it is almost always a very bad thing. If the API is based on the implementation, changes to the implementation usually lead to changes in the API. There is always the risk that the system winds up being usable only by people like the implementor, which simply doesn’t scale.

As I’ve written before, I’m a big fan of API-first design. Specify the requirements for a system. Develop an API that captures the service contract. Test that API: ask potential users of the system to work through how they would code to it – BEFORE YOU WRITE A SINGLE LINE OF IMPLEMENTATION CODE. Think about the variety of possible implementations – technology, scale, private vs. multi-tenant, billed vs. free. Then implement. The result is likely to be a more stable service contract, with fewer incompatible changes from release to release, and with greater interoperability between deployments.

First blogroll, now feeds

After grinding through my OPML file to build an up-to-date Blogroll of cloud authorities, I’ve decided to deal with the forthcoming shutdown of Google Reader. I’m switching to Feedly: I’ve installed the Feedly extension for Safari on my Macs, and the Feedly iOS app on my iPhone and iPad. The UI seems fairly good, though I haven’t yet tried to crunch through the couple of hundred items I see on a typical weekday morning. At this stage, Feedly is still using the Google Reader back-end; I hope they get the new “Normandy” backend ready in plenty of time. And I really hope that the economics works out for them.

Although we all moaned about the decision to shutter it, Google Reader really made no business sense. Google was operating a huge pro bono database supporting millions of users via dozens of third-party clients (some free, some paid), and they had no way of monetizing it. I was surprised that it lasted as long as it did. I would have been willing to pay (quite a bit, actually), but it would have been really hard to retrofit the necessary changes into the API. If Feedly decides that they need to find some revenue, I hope that they provide an advertising-free, non-rate-limited “Pro” option for people like me.

Blogrolling

I’ve always been a huge user of Google Reader. I follow hundreds of feeds, in areas as diverse as cloud computing, Formula 1 racing, politics, philosophy, literature, atheism, genetics and behavioral economics. Obviously I’m going to be looking for a new back-end service to support the variety of clients that I use, but in the meantime I decided to transfer my RSS feeds related to cloud computing into the Blogroll for this blog.

It was a fairly tedious process, because I took the time to visit every blog. I wanted to make sure that it was really useful (subscribing is sometimes too easy), and that it was active – it had been updated at least once in 2013. I was disappointed to see how many blogs that I really enjoyed had fallen by the wayside (I’m looking at you, @ruv and @bradcasemore!), but then I’m hardly one to speak!

Anyway, you can see the result on the right. It’s a long list, and I decided to tag my top 10. If I were to lose all RSS feeds, I’d still be checking these ones by hand every day.

Exactly what “Concept” are you trying to “Prove”?

A bunch of us* had an interesting discussion on Twitter yesterday evening on the topic of customer readiness for cloud computing. It was kicked off by Aneel asking

Why does anyone think it’s ok to extrapolate from their one-old-laptop-lab to any kind of real deployment of anything?

to which I replied

Indeed. Why are cloud PoCs sold as “we’ll run on whatever you have” (inc. VMware) when no sane person would actually deploy on it?

One big problem seems to be that many companies don’t actually fund internal investigations of new technology, and so cloud computing experiments are brought in “under the radar” to avoid procurement hell. And then the team doing the work formalizes the work by calling it a Proof of Concept. Unfortunately they don’t actually think through what “concept” they need to “prove”. They can’t demonstrate operations at scale, since that would cost too much and/or provoke potential antibodies, so they’re reduced to showing that they can build a cloud – or as I put it

Like saying “I can take all the Legos out of the box and snap them together correctly.” Big whoopee….

Of course, such a “PoC” doesn’t prepare the team for the real issues of large-scale deployment and operations, and they often fail. (If they haven’t got the clout to take on corporate procurement, they’ll never be able to deal with the legacy ops teams.) This is bad for the customer, but can be disastrous for the startup company whose technology they were testing. Brian asked why startups don’t push back, and walk away from doomed PoCs; he cited Marten Mickos of Eucalyptus as being firm on that point. George and Randy both said that they push back constantly, but that startups rarely have the necessary leverage. However George said that his organization had

… gotten to the point where we’ve developed a clear PoC checklist for readiness. [...] There are 2 parts to a cloud PoC: scalability of operations and internalization of consumption. [...] It starts w/ clearly defining PoC objectives b4 signing contract. [...] And then managing to those objectives.

We wrapped up by wondering if this would be a suitable topic for a panel session for a conference. We decided that it would have to include @botchagalupe and beer….


* @aneel, @cloudtoad, @georgereese, @brianmccallion, @randybias, @rfflores, @botchagalupe and me, @geoffarnold

The cloud is not a layer on top of what you already have

There are two widespread ideas about open source IaaS systems. The first is that they are like “Linux for the cloud”; the second that they correspond to one layer (API with orchestration) in a stack of technology that makes up the total system. Both of these have just enough truth to be deeply misleading. A system like OpenStack is a complex distributed system, with much greater architectural complexity than a POSIX operating system or a SQL data base. This complexity is reflected in the number of system elements and the variety of alternatives that must be accommodated for each element. And while it is true that the public face of an IaaS system is represented by an API (or GUI) and the complex orchestration that each request triggers, it is more than a veneer of automation over an existing system.

In most cases, the existing infrastructure – virtualization system, compute, networking, storage – has been assembled over a number of years, and reflects both technology choices and operational policies and procedures. Traditional practices for resource allocation and configuration usually involve a change management procedure including trouble ticketing systems, management review, and implementation checklists to mitigate human error. They operate at a timescale of days or weeks, reflecting the economics and risk management associated with human activities. One cannot simply add a layer of automation requiring responses in seconds or minutes on top of such a system.

The fact is that an IaaS system is functionally complex and deep. It’s complex because of the variety of services that are being delivered to the users of the IaaS, and the way in which lower-level resources must be allocated, composed and configured to deliver these services. There are complex interdependencies between the various services and resources, which leads to the need for orchestration. All of these resources are managed through different subsystems, often supplied by different vendors, and most of these subsystems are running on different systems, communicating over a set of networking resources, and sharing a variety of common services for storage, security, and so forth. All of these elements are typically replicated for reasons of availability and scale. And everything is being driven by API calls with stringent latency requirements. Temporal constraints – and economics – means that all of this must be 100% automated.

We cannot simply add a layer of IaaS on top of existing equipment, software, and operations. Every layer of the technology and operations of an IaaS, from API and GUI to physical infrastructure lifecycle management, needs to be re-thought. This doesn’t necessarily mean that IaaS is restricted to pure “greenfield” opportunities. However, rather than asking “what will work with what we already have?”, we need to start with a holistic architecture for the IaaS that reflects the technical requirements and business objectives. When we have this, we can consider the selective re-use of existing equipment, technologies and operational policies – but only when this makes good, long-term, economic and operational sense.

Challenges from the cloud for enterprise vendors

An important aspect of the shift from classical IT architecture to cloud computing is way in which it changes requirements and priorities for solution components. Infrastructure-as-a-service clouds such as the ones Rackspace, Amazon, HP and Softlayer operate are based on a quid pro quo:

  • you can have on-demand, pay-as-you-go resources, and we’ll take care of all of the operational complexity and physical security, but
  • you only get to use a small set of standardized, generic compute, storage and networking resources, configured in a few simple patterns

The constraints on variety and complexity are inevitable consequences of the economic model of the cloud: resources are pooled, and provisioning is fully automated, based on API calls or GUI interactions. There’s no more “cut a ticket, we’ll get back to you in 24 hours”; response times are measured in seconds and minutes, which precludes any operational practices that require human intervention. And the functionality exposed to the users of the cloud is constrained by the cloud API, which inevitably tends to be a lowest-common-denominator compromise.

For the operator of a cloud, this operational model means that the requirements for infrastructure hardware and software are different from traditional IT.

  • Functional features are less important: as long as the basic features are supported, esoteric and specialized functions are irrelevant.
  • Much more important are operational characteristics, especially those that enable automated provisioning and configuration, at scale. For example, users who have worked with public clouds will expect a security group (integrated router/FW/NAT) to be provisioned in a matter of minutes, not hours or days.
  • Features that help keep the lights on are critically important. Excellent logging, alarming, failover, upgrade, roll-back are baseline requirements. Fail-in-place is essential.
  • Throughput is less important, because of the switch from scale-up to scale-out; from dedicated hardware/software appliances to VM-based software appliances.
  • Infrastructure change management is simplified, because shared appliances are replaced by per-tenant VMs, reducing the complexity and limiting the scope of any errors. (The need for disciplined change management doesn’t go away, of course: it just moves up to the virtual infrastructure level.)
  • Many traditional aspects of enterprise software, such as inflexible licensing, are hard to accommodate within this new model.
  • Cloud computing is still very immature, which means that pioneering service providers (whether public or private) are faced with the need to assemble a collection of ill-fitting components. They are naturally drawn to solution elements which make use of open source software frameworks, open APIs, and standard OS and hardware. This is a particular challenge for traditional enterprise vendors.

Thought for the day

[This post first appeared at GeoffArnold.com.]

The tl;dr version: Arguably all interesting advances in computer science and software engineering occur when a resource that was previously scarce or expensive becomes cheap and plentiful.

The longer version:

This particular thought was provoked by a series of exchanges on blogs and in Twitter yesterday. It started with a piece at Information Week in which Joe Emison bemoaned the fact that Netflix was holding back progress in cloud computing. The Clouderati jumped all over this, and Adrian put together a detailed response which he also posted to his blog. By the time I got around to responding, IW had closed comments on the original piece, and so I followed up on Adrian’s blog.

Joe’s criticism was based on two points:

Netflix’s cloud architecture[...] is fundamentally (a) so intertwined with AWS as to be essentially inseparable, and (b) significantly behind the best *general* open options for configuration management and orchestration.

Point (a) is pretty silly: Netflix is a business, not a charity. Of course they’re going to work with the best of breed. But it was Joe’s second point that really bugged me. I responded (and here’s where the “Thought for the day” comes in):

Amazon and Netflix are dramatically ahead of the curve, not behind it. The configuration management pattern you seem to prefer – just-in-time customization using Chef or Puppet – was pretty old school when Sun acquired CenterRun and built out N1 and Grid Engine. It’s incredibly inefficient compared with early-bound EBS-backed AMIs.

Arguably all interesting advances in computer science and software engineering occur when a resource that was previously scarce or expensive becomes cheap and plentiful. We’ve seen it with graphical user interfaces, interpreted languages, distributed storage, and SOA. Traditional late-bound configuration management treats machine images and VM instances as expensive; AWS and Netflix invite you to imagine the possibilities if they’re effectively free. Welcome to the real Cloud 2.0…

In a subsequent Twitter exchange, I said:

@adrianco We used to talk about “specific excess MIPS” driving change. Now it’s “specific excess VMs”

… to which Adrian replied:

@geoffarnold with SSD excess IOPS can be used in interesting ways

Live, learn, fix, repeat: a conversation with Derrick Harris

On Monday I had a long conversation with Derrick Harris of GigaOM, which turned into a piece entitled How to deal with cloud failure: Live, learn, fix, repeat. This was very well received (“one of the most intelligent articles I’ve read on this topic”), and I’d like to add a couple of related observations that we couldn’t work into the conversation.

Large scale cloud computing systems are characterized by two important properties. First, many of their functions are asynchronous, not transactional. The user issues an API call to start a VM in EC2 or OpenStack, the service acknowledges the request, and at some point later on the operation is completed. Usually. Perhaps not; maybe the request will fail. This is hard enough to deal with; when we start to compose multiple asynchronous mechanisms, the complexity can spiral out of control. For example, what happens when we bring together:

  • An autoscaling system which monitors traffic and adds or removes capacity (VMs) as needed.
  • A software deployment system which can roll out a new release to an existing fleet of machines at a controlled rate, and supports roll-back if the new software proves buggy.
  • A self-healing mechanism which detects failed or “sick” instances of a fleet of VMs, and restarts or replaces them.
  • A fault injection system (like the Netflix “Chaos Monkey”) which exercises the robustness of the system by periodically killing components.
  • And all of these using an asynchronous API for managing VM instances.

While it may be possible to predict the eventual state of such a system, the path to that state, and the time to reach it, is extraordinarily hard to wrap your head around.

The second characteristic of today’s cloud systems is the massive levels of replication. S3 or Swift objects are stored in multiple data centers, service instances are replicated behind load balancers, load-balanced clusters are replicated in multiple data centers behind DNS anycasting or round-robin, and so forth. This is nothing new: we’ve been replicating, clustering, and doing fail-over since the 1960s. However at scale we cannot always afford to wait until replication has been completed – we can’t wrap a neat transaction around it – and we have embraced designs based on “eventual consistency” as a way of boosting throughput.

Put these two trends together, and we have systems that are radically asynchronous, massively replicated, and non-deterministic. These systems are hard to work with. Developers and operators like things to be transactional, linear, deterministic: “get used to disappointment”. We’re still learning how to work with systems like this, and we won’t always get it right. I was particularly peeved to read a simplistic assessment of the recent AWS outage in which the author claimed that…

… affected customers — Instagram, Pinterest, Pocket and Netflix, which all suffered from Amazon’s recent outage on the weekend — hadn’t used the ability of the cloud to create geographically redundant links.

“They could operate at a higher level of redundancy, so that these sort of outages would only have a minimal impact on them. It’s a matter of cost,” Bettin said.

In a word, bullshit. I know many of the guys who built these systems, and they have all incorporated high levels of redundancy in their system. None of them are “backup cheapskates” – that’s an insult which simply betrays the author’s failure to appreciate the complexity of the domain. I don’t know all of the root causes of their outages, but I’m prepared to bet that most were due to bugs arising from complex interactions between multiple systems, of the kind that I discussed earlier.

As I said to Derrick, I don’t expect the failure rate to plateau any time soon, because we are continually expanding scale, pushing performance, and introducing new sources of complexity. I do think that our tools for modeling, diagnosing, and repairing these systems are improving rapidly, however. I’ve mentioned Chaos Monkey as an example of fault injection techniques; I also expect to see more systems adopting anti-entropy elements, and dedicating a portion of the capacity of every system to in-production testing of functionality and performance.