OpenStack, service contracts, and interoperability

We’re going into the second day of the OpenStack Summit here in Portland, and for me the most important item on the agenda is the last session of the day: a panel on interoperability. Several leading members of the OpenStack community have expressed their concerns about the risk of divergence between different OpenStack-based services. Josh McKenty of Piston has been very vocal on the topic, and is proposing the idea of a RefStack which could be used for compliance.

To those of us that worked on Java in the early days, this feels very familiar. Welcome to the JCP.

While RefStack may be a necessary part of the solution to the interoperability problem, I am convinced that it isn’t sufficient. The biggest concern that I have is about the OpenStack design process. Everybody seems to be busily designing individual features and mechanisms, but nobody appears to be responsible for the specification of the OpenStack service itself. And that’s a problem.

There was a great example of this yesterday: a minor discussion in the networking part of the Design Summit. The details are relatively unimportant: a proposal for a mechanism to allow Quantum to provide some network hop cost information to the Nova scheduler. (It was a pretty bad design, IMHO.) But what was interesting was the use case which the authors provided to motivate the discussion: how to optimize the placement of elements of a scale-out three-tier application. In order to exploit the proposed mechanism, some process would have to compute the cost (measured in hops, latency, or whatever) of every configuration of available resources that matched the application graph. The problem is immediately obvious: who could carry out this process? The Nova Scheduler can’t, because it has no way of predicting the topology that will result from a sequence of VM instantiations. The IaaS user can’t, because she doesn’t have access to the full list of Nova resources. Of course a declarative model (like that provided in the vCloud API) would do the trick, but OpenStack doesn’t offer such a service.

So in many ways this was a pretty futile discussion, and for me the main reason was simple: there were no service-level requirements. There was a free-floating use case, unanchored to any external API or user-level abstration.

Now I’m not saying that this is an unimportant problem. It isn’t. It is certainly useful to allow an OpenStack user to provide a hint that certain instances will be communicating intensively, and that the scheduler should try to place them in such a way as to minimize latency. (And we have a plausible example of a service-level abstraction that captures this: the EC2 “Placement Group” concept.) We also have the case where the user requires that instances are placed on different servers or racks, to provide increased availability. What should the right abstractions look like? How do we represent this in the APIs – both syntax and semantics? What kind of SLAs are appropriate? And are there any constraints or dependencies associated with this aspect of the service? All of this would seem to be necessary before diving into a discussion of intra-system mechanisms.

There is always a risk when building a complex system that the external interface will emerge as an almost accidental side-effect of the implementation process. For isolated implementations, this may be OK – the client can meet the service half-way. For durable platforms, it is almost always a very bad thing. If the API is based on the implementation, changes to the implementation usually lead to changes in the API. There is always the risk that the system winds up being usable only by people like the implementor, which simply doesn’t scale.

As I’ve written before, I’m a big fan of API-first design. Specify the requirements for a system. Develop an API that captures the service contract. Test that API: ask potential users of the system to work through how they would code to it – BEFORE YOU WRITE A SINGLE LINE OF IMPLEMENTATION CODE. Think about the variety of possible implementations – technology, scale, private vs. multi-tenant, billed vs. free. Then implement. The result is likely to be a more stable service contract, with fewer incompatible changes from release to release, and with greater interoperability between deployments.

Comments are closed.