“Don’t give up on OpenStack”? I’m trying not to…

Over at Information Week, Andrew Froehlich pleads “Don’t Give Up On OpenStack“. And I’m not. But as I commented, the change that is needed conflicts with one of the deepest impulses in any community-based activity, from parent-teacher organizations to open source software projects:

The hardest thing for an open source project to establish is a way of saying “no” to contributors, whether they be individual True Believers in Open Source, or vast commercial enterprises seeking an architectural advantage over their competitors. The impulse is always to do things which increase the number of participants, and it is assumed that saying “no” will have the opposite effect. In a consensus-driven community, this is a hard issue to resolve.

(My emphasis.)

“Ubiquitous standard” vs. “open source”

Last week I wrote the following on FaceBook:

I need to write a blog post about open source: specifically about those people who respond to any criticism of a project by saying, “The community is open to all – if you want to influence the project, start contributing, write code.” Because, seriously, if you want your project to be really successful, that attitude simply won’t work. It doesn’t scale. You want to have more users than implementors, and those users need a voice. (And at scale, implementors are lousy proxies for users.)

Maybe this weekend.

This is that blog post. It has been helped by the many friends who added their comments to that FaceBook post, and I’ll be quoting from some of them.

Over the last few months, I’ve been trying to figure out why I’m uncomfortable with the present state of OpenStack, and whether the problem is with me or with OpenStack. Various issues have surfaced – the status of AWS APIs, the “OpenStack Core” project, the state of Neutron, the debate over Solum – and they all seem to come down to a pair of related questions: what is OpenStack trying to do, and who’s making the decisions? And my discomfort arises from my fear that, right now, the two answers are mutually incompatible.

First, the what. That’s pretty simple on the surface. We can look at the language on the OpenStack website, and listen to the Summit keynotes, and (since actions speak truer, if not louder, than words) see the kind of projects which are being accepted. OpenStack is all about building a ubiquitous cloud operating system. There is to be one code base, with an open source version of every single function. And although the system will be delivered through many channels to a wide variety of customers, the intent is that all deployed OpenStack clouds will be fundamentally interoperable; that to use the name OpenStack one must pass a black-box compatibility test. (That’s the easy bit; the more fuzzy requirement is that your implementation must be based on some minimum set of the open source code.) This interoperability is motivated by two goals: first, to avoid the emergence of (probably closed) forks, and second to enable the creation of a strong hybrid cloud marketplace, with brokers, load-bursting, and so forth.

This means that the OpenStack APIs (and, in some less well defined sense, the code) are intended to become a de facto standard. In the public cloud space, while Rackspace, HP, IBM and others will compete on price, support, and added-value services, they are all expected to offer services with binary API compatibility. This means that their customers, presumably numbering in the tens or hundreds of thousands, are all users of the OpenStack APIs. They have a huge stake in the governance and quality of those APIs, and will have their opinions about how they should evolve.

How are their voices heard? What role do they play in making the decisions?

Today, the answers are “they’re not” and “none”. Because that’s not how open source projects work. As Simon Phipps wrote, “…open source communities are not there to serve end users. They are there to serve the needs of the people who show up to collaborate.” And he went on to say,

What has never worked anywhere I have watched has been users dictating function to developers. They get the response Geoff gave in his original remarks, but what that response really means is “the people who are entitled to tell me what to do are the people who pay me or who are the target of my volunteerism — not you with your sense of entitlement who don’t even help with documentation, bug reports or FAQ editing let alone coding.”

As I see it, there are two basic problems with this thinking. First, it doesn’t scale. If you have hundreds of thousands of users, of whom a small percentage want to help, and a couple of dozen core committers, the logistics simply won’t work. (And developers are often very bad at understanding the requirements of real users.) Maybe you can use the big cloud operators and distribution vendors as proxies for the users, but open source communities can be just as intolerant of corporate muscle as they are of users who don’t get involved. Some of that can be managed — much of the resources that Simon terms “volunteerism” actually comes from corporations — but not all.

But the second problem is that the expectation of collaborator primacy is incompatible with the goal of creating a standard. The users of OpenStack have a choice, and the standard will be what they choose. If the OpenStack community wants their technology to become that standard, they must find a way to respect the needs and expectations of the users. And that, quid pro quo, means giving up some control, which will affect what the community gets to do — or, more often, not do.

This is a much bigger issue for OpenStack than for other open source projects, for a couple of reasons. Several of the most successful projects chose to implement existing de facto or de jure standards — POSIX, x86, X Windows, SQL, SMB, TCP/IP. For these projects, the users knew what to expect, there were alternative (non-FOSS) alternatives, and the open source communities accepted the restrictions involved in not compromising the standards. Other projects were structured around a relatively compact functional idea — Hadoop, MongoDB, Xen, MemCache, etc. — and the target audience was relatively small: like-minded developers. OpenStack is neither compact, nor (by choice) standards-based. It has a large number of components, with a very large number of pluggable interfaces. The only comparable open source project is OpenOffice, which had a similarly ambitious mission statement and has also had its share of governance issues.

One obvious step towards addressing the problem would be to distinguish between the “what” and the “how” of OpenStack. Users are interested in the APIs that they use to launch and kill VMs, but they are unlikely to know or care about the way in which Nova uses Neutron services to set up and tear down the network interfaces for those VMs. Good software engineering principles call for a clear distinction between different kinds of interfaces in a distributed system, together with clear policies about evolution, backward compatibility, and deprecation for each type. This is not simply a matter of governance. For example, a user-facing interface may have specific requirements in areas such as security, load balancing, metering, and DNS visibility that are not applicable to intrasystem interfaces. There is also the matter of consistency. Large systems typically have multiple APIs — process management, storage, networking, security, and so forth — and it is important that the various APIs exhibit consistent semantics in matters that cut across the different domains.

The adoption of some kind of interface taxonomy and governance model seems necessary for OpenStack, so that even if community members have to relinquish some control over the user-facing interfaces (the “what”), they still have unfettered freedom with the internal implementation (the “how”). Today, however, we are a long way from that. OpenStack consists of a collection of services, each with its own API and complex interdependencies. There is no clear distinction between internal and external interfaces, and there are significant inconsistencies between the different service APIs. At the last OpenStack Summit I was appalled to read several BluePrints (project proposals) which described changes to user-facing APIs without providing any kind of external use-case justification.

The present situation seems untenable. If OpenStack wants to become an industry standard for cloud computing, it will have to accept that there are multiple stakeholders involved in such a process, and that the “people who show up to collaborate” can’t simply do whatever they want. At a minimum, this will affect the governance — consistency, stability — of the user-facing interfaces; in practice it will also drive functional requirements. Without this, traditional enterprise software vendors delivering OpenStack-based products and services will have a hard time reconciling their customers’ expectations of stability with the volatility of a governance-free project. The bottom line: either the community will evolve to meet these new realities, or OpenStack will fail to meet its ambitious goals.

UPDATE: Rob Hirschfeld seems to want it both ways in his latest piece. Resolving the tension between ubiquitous success and participatory primacy doesn’t necessarily require a “benevolent dictator”, but that’s one way of getting the community to agree on a durable governance model.

Whither OpenStack: A footnote on scale

My blog yesterday Whither OpenStack was already too long, but I wish I’d included some of the points from this outstanding piece by Bernard Golden on AWS vs. CSPs: Hardware Infrastructure. You should read the whole thing, but the key message is this:

Amazon, however, appears to hold the view that it is not operating an extension to a well-established industry, in which effort is best expended in marketing and sales activities pursued in an attempt to build a customer base that can be defended through brand, selective financial incentives, and personal attention. Instead, it seems to view cloud computing as a new technology platform with unique characteristics — and, in turn, has decided that leveraging established, standardized designs is inappropriate. This decision, in turn, has begotten a decision that Amazon will create its own integrated cloud environment incorporating hardware components designed to its own specifications.

As you read Bernard’s piece, think about the architecture of the software that transforms Amazon’s custom hardware into the set of services which AWS users experience. It has about as much in common with, say, OpenStack’s DevStack (or, too be fair, Eucalyptus FastStart – sorry, Mårten!) as does a supertanker to a powerboat.

Container Ship-1

In this world, you can’t start small and then “add scale”; the characteristics needed to operate at extreme scale become fundamental technical (and business) requirements that drive the systems architecture.

This is the challenge that Rackspace, HP and others face. Their core software was designed – is, continually, being evolved – by a community that doesn’t have those architectural requirements. This is not a criticism; it’s simply the reality of working in a world defined by things like:

  • Tempest should be able to run against any OpenStack cloud, be it a one node devstack install, a 20 node lxc cloud, or a 1000 node kvm cloud.

and this piece on OpenStack HA (read through to see all of the caveats). I know the guys at HP and Rackspace: they are smart, creative engineers, and I’m sure they can build software as good as AWS. The question is, can they do so while remaining coupled to an open source community that doesn’t have the same requirements?

Whither OpenStack?

tl;dr
OpenStack’s sweet spots seem to be SaaS providers and carriers. Public deployments will struggle; private clouds are difficult and may be ephemeral.

Context
It’s two weeks after the OpenStack Summit in Hong Kong and one week after the AWS re:Invent event in Las Vegas, and social media is full of passionate debate about the state of OpenStack, the future of private clouds, the juggernaut that is AWS, and more.

For those less Twitter-obsessed than I, here are a few of the key pieces:

to whom it may concern

What I saw at the OpenStack Summit

Why vendors can’t sell OpenStack to enterprises

Not Everyone Believes That OpenStack Has Succeeded

Inside OpenStack: Gifted, troubled project that wants to clobber Amazon

OpenStack Wins Developers’ Hearts, But Not IT’s Minds

The last twelve months

The End of Private Cloud – 5 Stages of Loss and Grief

Most of the discussion is focussed on the Holy Grail of “enterprise”, and that was certainly the focus of re:Invent. But that’s not the only market for OpenStack; as I wrote in “A funny thing happened on the way to the cloud” we’ve had substantial “mission creep” since the days of the NIST taxonomy. Different members of the community are interested in addressing different kinds of use cases with OpenStack. How is this affecting the architecture and processes of OpenStack? Is it practical for OpenStack to serve all of these needs equally well, and what are the costs of doing so?

There are some pundits (@krishnan, for instance, and @cloudpundit) who argue that OpenStack’s role is to be a kit of parts from which different organizations – vendors and large users – will assemble a variety of solutions. On this view, it doesn’t particularly matter if the APIs for different OpenStack services are somewhat inconsistent, because the creator of the public cloud or distribution will do the necessary work on “fit and finish”; if necessary they may replace an unsuitable service with an alternative implementation. (At the extreme end of that camp we have people like @randybias who want to replace the entire API with an AWS workalike.) On the other hand, there is a movement afoot, led by @jmckenty, @zehicle and others, to develop a certification process to improve interoperability of OpenStack implementations in the service of hybrid deployments and to help to grow the developer ecosystem. Rather than asking which of these is the “right” position, it’s probably more instructive to see how the OpenStack community is actually behaving.

Markets
There seem to be five distinct areas where OpenStack is being used:

  • Public IaaS cloud – Rackspace, HP, etc.
  • SaaS provider – PayPal, Yahoo, Cisco WebEx
  • Carrier infrastructure – AT&T, Verizon
  • Private IaaS cloud (often hosted)
  • Enterprise datacenter automation

Most of these are fairly self-explanatory, but the distinction between the last two is important. Both are typically enterprise or government customers. The first is usually a greenfield deployment with a “clean sheet” operational philosophy ; the second is an attempt to provide some automation and self-service to an existing enterprise data center, complete with heterogeneous infrastructure and traditional operational policies.

Let’s see how OpenStack is doing in each of these areas:

Public IaaS cloud
Public cloud service is all about the economics of operation at scale. Stable interfaces – both APIs and tools. Consistent abstractions, so that you can change the implementation without breaking the contract with your customers. Measuring everything. Automating the full lifecycle. Capacity planning is key.

OpenStack has been shortchanging this area. The API story is weak, with too many changes without adequate compatibility. The default networking framework doesn’t really scale, and alternatives like NSX, Nuage, OpenContrail and Midonet simply replace all of the Neutron mechanisms. (They don’t necessarily interoperate with all of the vendor-supplied Neutron plugins.) Mechanisms for large-scale application deployments, like availability zones and regions, are implemented inconsistently across the various services.

On the other hand, public clouds are typically (or ideally!) operated at a large enough scale that, as Werner Vogels put it, “software costs round to zero”. So they can afford to throw engineering resources at filling the gaps and fixing the issues.

The most difficult issue for public clouds based on OpenStack is around features. The main competitors are AWS, Google, and Microsoft, all of which can add new services, focussed on customer requirements, much more quickly than the OpenStack community. Rackspace, HP and others face a dilemma: do they wait for the OpenStack community to define and implement a new service, or do they create their own service offerings that are not part of the OpenStack code base? Waiting for the community cedes the market to the proprietary competition, and has other complications, such as the requirement that there has to be an open source reference implementation of every OpenStack service, and the potential for compromise to address the needs of different parts of the community. Proceeding independently may help to close the competitive feature gap, but it’s likely to lead to substantial “tech debt” and/or compatibility issues when the community finally gets round to delivering a comparable service.

SaaS provider
A SaaS provider combines the operational scale of a public cloud with the captive tenant base of a private cloud. Large-scale networking issues dominate the architectural discussion. The dominant KPI is likely to be “code pushes per day”. API issues are less critical, since there is usually a comprehensive home-grown applications management framework in use. As with the public cloud, the SaaS provider has the expertise and engineering resources to do large scale customization and augmentation.

OpenStack is serving this constituency relatively well, although scalability remains a concern.

Carrier infrastructure
Wireless and wire-line carriers are looking forward to NFV, which will allow them to replace dedicated networking infrastructure with virtualized software components that can be deployed flexibly and efficiently. Is is therefore not surprising that they are interested in infrastructure automation technologies that will facilitate the deployment of VMs and the configuration of their networks. What distinguishes the carriers from other OpenStack users is that their applications often cut across the typical layers of abstraction, particularly with respect to networking. In a public IaaS, the tenant VMs interact with virtualized networking resources – ports, subnets, routers, and load balancers. They have no visibility into the underlying technologies used to construct these abstractions: virtual switches, encapsulation, tunnels, and physical and virtual network appliances. This opaque abstraction is important for portability and interoperability. For carriers, it is often irrelevant: their applications may perform direct packet encapsulation, and can manipulate the chain of NFV services.

There’s a lot of interest in these use cases within OpenStack today. One obvious concern relates to the status of the APIs involved. Public cloud providers probably won’t want their tenants diving in to manipulate service chaining, or getting access to the MPLS or VXLAN configuration of the overlay network. Today the only way of limiting access to specific OpenStack APIs is the Keystone RBAC mechanism, which doesn’t enforce any kind of semantic consistency. One solution might be to package up specific APIs into different OpenStack “Editions”.

It seems likely that the specific use cases for OpenStack in managing carrier infrastructure are sufficiently bounded that the lack of major application services will not be a problem.

Private IaaS cloud
There is a persistent belief that enterprise customers want – and need – private IaaS clouds. Not IaaS-like features bolted on to their existing infrastructure, but pure NIST-compliant IaaS clouds that just happen to be private, running on wholly-owned physical infrastructure. There are several arguments advanced for this. One – InfoSec – is probably unsustainable: public clouds invest far more in security and compliance than any enterprise could hope to, and the laws and regulations will soon reflect this. The second – cost – is occasionally valid, but widely abused: ROI analyses rarely take into account all costs over a reasonable period of time. In addition, the benefits of an IaaS cloud usually depend on the development of new, cloud-aware applications, and such applications can usually be designed to operate more cost-effectively in a public cloud.

So how’s OpenStack doing for private clouds? Not very well. The cost and complexity of deploying OpenStack is extremely high, even if you work with an OpenStack distribution vendor and take advantage of their consulting services. Yes, there are plenty of tools for doing an initial deployment (too many), but almost none for long-term maintenance. To achieve enterprise-grade operational readiness you’ll have to supplement OpenStack with at least a dozen additional open source or commercial tools*, and do the integration yourself; then you’ll be responsible for maintaining this (unique) system indefinitely.

Analyst surveys suggest that most enterprises are looking at private clouds as part of a hybrid cloud strategy. In this case, the lack of high-fidelity compatibility with most public clouds is going to be a problem. There are actually two issues: API interoperability (e.g. good support for the AWS APIs in OpenStack), and feature mismatch (AWS has more, richer features than OpenStack, and the gap is growing).

Enterprise datacenter
Once upon a time, the private cloud was seen as a radical alternative to the traditional enterprise datacenter: an opportunity to replace bespoke server and networking configurations with interchangeable pools of infrastructure, and to deliver automated self-service operations in place of bureaucratic human procedures. Great emphasis was placed on the need to design the cloud service from the top down, focussing on the requirements of the users, rather than viewing it as a layer on top of existing enterprise virtualization systems. It was (correctly) assumed that many traditional data center management practices would be incompatible with the kind of automation provided by cloud management platforms like OpenStack and CloudStack.

Unfortunately, many enterprises felt the need to try to cut corners: to deploy IaaS within their existing data center environment, leveraging existing infrastructure. Some literally treated the cloud as “just another large application cluster”. Many of these early experiments failed, because of the difficulty of making cloud operations conform to existing policies. The number of successful projects of this kind is a matter of debate.

The OpenStack project has been doing a lot to facilitate this kind of deployment. Brocade and its partners have integrated FC SAN support into the Cinder storage service, and we’ve proposed improvements to Neutron that will make it much easier to use heterogeneous network resources from different vendors. Mirantis has worked with VMware to allow OpenStack to be deployed on top of vSphere, and Nova now supports the use of several different hypervisors within a single cloud. (The latter is presumably to cater to applications which are sensitive to specific hypervisor features – something that no modern cloud-ready application should care about.)

This work to accommodate legacy infrastructure is obviously addressing a real need. It’s worth asking what the cost has been, particularly in complexity, stability, API governance, and opportunity cost. Could we have delivered a decent load-balancing solution earlier? Would we have a more scalable L3 capability? Hard to tell.

Summary
So where does this leave us? It seems to me that the sweet spot for OpenStack today (and for some time to come) is going to be with the SaaS provider, such as PayPal, Cisco WebEx, and Yahoo. (I wonder if the recent announcement by Salesforce.com and HP means that SFDC will be moving in that direction.) Carriers will happily do their own thing, with potentially awkward implications for networking APIs. Public clouds will face the challenge of back-porting their (many) changes to the trunk, and figuring out how to keep up with AWS. And enterprise use will continue to be challenged by the complexity and cost of setting up and then maintaining private clouds, whether green-field or add-in.


* E.g. API management, identity integration, guest OS images, DNS, SIEM, monitoring, log analysis, billing, capacity planning, load testing, asset management, ticket management, configuration management

Following up on the Seagate Kinetic piece…

Following up on my unexpectedly viral piece about Seagate’s Ethernet-attached disk architecture (40,000 hits so far), here are a couple of pictures from the OpenStack Design Summit session on Swift + Kinetic:


IMG_0951



IMG_0952


Updated to add one more picture. (When I was at Sun, I used to say that the truth was not in press releases or Powerpoint decks but in t-shirts.)

IMG_0955

And yes, I want one of those t-shirts!

Preparing for Hong Kong

It’s a beautiful Sunday afternoon here in Silicon Valley. It’s been a good weekend for sports: Manchester United won (finally!), Sebastian Vettel claimed his fourth F1 Driver’s Championship, and the Patriots came from behind to thrash Miami. And then there’s the Red Sox; oh well, three out of four isn’t bad. But all of those events are sitting on my DVR, because for the next week I’m focussed on one thing: preparing for the upcoming OpenStack Summit.

This time next week I’ll be in Hong Kong as part of the Brocade team, joining thousands of cloud computing technologists, users, salespeople, and writers for a week of business and technical sessions. My main focus will be on the Design Summit sessions for Neutron, the OpenStack networking subsystem formerly known as Quantum. My colleagues will be involved in a variety of areas, including FC SAN features for the Cinder storage service, load balancers, and integration of our VCS fabric. Several of them are presenting in the main Summit. And we’ll all be talking to customers and partners.

OpenStack networking is complicated. This is mostly due to the fact that data center networking is going through a period of massive disruption in several different areas, leading to a combinatorial explosion of complexity. Overlay architectures, different kinds of tunneled underlay, the replacement of dedicated network equipment by software running in VMs, the emergence of controller-based SDN such as the OpenDaylight project, and the spectacular performance improvements in merchant silicon and x86 processors: these have resulted in many innovative products from startups and established vendors, all of whom are keen to participate in OpenStack. In part, it’s because the OpenStack mission has been expanding from a simple EC2-style IaaS to include legacy data center automation and carrier NFV. Public clouds emphasize abstraction and multi-tenant isolation, features which are less relevant for other users of the technology, and it’s challenging to develop abstractions and APIs which address all of the use cases. There is still a lively debate on which parts of OpenStack are “core” elements of every OpenStack system. (Indeed the original Nova networking system is still the default; deprecation is planned for the upcoming Icehouse cycle.)

In this exciting and unpredictable environment, my team has been working on a project to manage some of the diversity. In our Dynamic Network Resource Manager (DNRM) Blueprint, we’re proposing a framework for managing the pool of physical and virtual network resources from multiple vendors. It borrows an idea from the OpenStack Nova scheduler: the use of a policy-based resource allocator that abstracts away the complexity of resource management, and allows each cloud operator to choose the resource allocation policy which fits their environment.

We’re demonstrating a proof-of-concept implementation of DNRM that uses the Brocade Vyatta vRouter, probably the most widely used virtual networking appliance. The DNRM resource manager uses Nova to provision a number of Vyatta virtual machines. Then a modified API handler in Neutron intercepts each client request to create an L3 Router, calls the policy-based DNRM allocator to find the best resource instance, examines the type of resource, and calls the appropriate driver (in this case the Vyatta driver) which talks to the VM to configure the vRouter. All of this can be viewed in the OpenStack Horizon dashboard; we’ve added a new panel which displays the state of the resource pool.

The Blueprint explores a range of use cases that are supported by the DNRM framework. Several of Brocade’s customers are particularly interested in the ability to allocate virtual appliances for dev/test networks and physical systems for production traffic, without changing any code. Others focus on the way it supports resources from multiple vendors, or the ability to choose specific resources to meet compliance requirements.

Inevitably such a comprehensive mechanism as DNRM overlaps several projects within Neutron, including the FWaaS, LBaaS, and VPNaaS work. In recent weeks we’ve been meeting with many of the other contributors to OpenStack to thrash out the details of what a final architecture should look like. I’m looking forward to the Design Summit sessions in Hong Kong, which should lead to agreement on a program of work for the next Icehouse release of OpenStack. It’s going to be complicated, for the reasons that I already mentioned, but I think this increasing complexity emphasizes the need to provide cloud operators with policy-based automation tools.

And when I get back from Hong Kong on the 10th, I’ll see which of those sporting events I still want to watch!

UPDATE: Also pushed (with minor edits)to the Brocade Data Center blog.

Reinventing storage – Ethernet über alles!

This is long. For the “tl;dr” crowd: Seagate just reinvented the disk interface using Ethernet, TCP/IP, Protocol Buffers, and key-value objects. And it’s really, really cool.

But if you have a few minutes to spare…

When I first got involved in data center networking in the early 1980s, there were several competing technologies. The two leaders were Ethernet and Token Ring, and although Bob Metcalfe had invented Ethernet, his first company 3Com actually sold both. Within a couple of years, economics, obstinacy by IBM, and a patent troll had taken Token Ring out of the picture, and Ethernet ruled. It quickly evolved from its shared-media topology: in 1987 SynOptics introduced the first Ethernet Hub, and two years later Kalpana broke the mold with the first Ethernet switch. Many of us concluded that whatever the future LAN technologies might look like, they would be called Ethernet.

The history of protocol stacks roughly paralleled that of LAN technology. In the early 1980s there were many candidates – NetWare, XNS, ARCnet, NETBEUI, OSI, AppleTalk, and others, as well as TCP/IP. By the end of the decade, TCP/IP had won. Some companies rehosted their application protocols on top of TCP/IP (I’m ashamed to say that my name is on the RFCs for NetBIOS-over-TCP), but most disappeared or pivoted away, like Novell.

Over the last 20 years, we’ve seen a steady process of convergence around Ethernet and TCP/IP. (Metro Ethernet is a fascinating and unexpected example.) Fibre channel was introduced in 1988 as a replacement for HIPPI in storage area networking. Twenty years later some companies tried to layer the FC protocols directly over Ethernet (FCoE). Most regard this as a failed experiment: although it slightly simplified cabling, the FC protocols were too inflexible to work well in a noisy LAN, and the lack of routability conflicted with data center networking practices. Instead, people started to experiment with storage protocols running over TCP/IP: iSCSI for block access, S3-like HTTP-based protocols for moving large objects around, and the perennial NFS and CIFS for file access.

One area that has so far remained untouched by this process of convergence is the connection between storage devices and computers. Even though the actual technologies have evolved – IDE, ATA, ATAPI, PATA, SCSI, ESDI, SATA, eSATA – the most common storage interconnection topologies are pretty much the same that IBM introduced with the S/360 mainframe in 1964: a controller device integrated into the computer, communicating with a small number of storage devices over a private short-range interconnect. The “private” bit is important; although various techniques have been created for shared (multi-master) access to the interconnect, all were relatively expensive, and none are supported by the consumer-grade drives which are often used for scale-out storage systems.

Historically, storage servers have been constructed as “black box” turnkey systems, from the Auspex NFS servers in the 1980s to the storage arrays from vendors like EMC and NetApp. More recently, people have been constructing interesting scale-out storage services from commodity hardware, using an x86 with a tray of consumer grade disks as a building block. However these architectures are constrained by the single point of failure and performance bottleneck introduced by the private interconnect between CPU and disks. (One odd consequence is that it is often hard to put together a economic “proof of concept” system, because the scale-out algorithms perform poorly with a small number of nodes.)

Over the years there have been various attempts at re-inventing this pattern. Most of these are based on the idea of moving more of the processing to the disk itself, taking advantage of the fact that every disk already has a certain amount of processing capacity to do things like bad sector remapping. Up until now, these efforts have been unsuccessful because of cost or architectural mis-match. But that’s about to change.

Yesterday Seagate introduced its Kinetic Open Storage Platform, and I’m simply blown away by it. It’s a truly elegant design, “as simple as possible, but no simpler”. The physical interconnect to the disk drive is now Ethernet. The interface is a simple key-value object oriented access scheme, implemented using Google Protocol Buffers. It supports key-based CRUD (create, read, update and delete); it also implements third-party transfers (“transfer the objects with keys X, Y and Z to the drive with IP address 1.2.3.4”). Configuration is based on DHCP, and everything can be authenticated and encrypted. The system supports a variety of key schemas to make it easy for various storage services to shard the data across multiple drives.

I love this design.

Don’t fall into the trap of thinking that this means we’ll see thousand upon thousands of individual smart disks on the data center LANs. That’s not the goal. (Or I don’t think it is.) EMC or NetApp can still use these drives to build big honking storage arrays, if they want to. The difference is that they have much more freedom in designing the internals of those arrays, because they don’t have to use one kind of (severely constrained) technology for one kind of traffic (disk data) and a completely different kind of technology for their internal HA traffic. They’re free to develop new kinds of internal topologies based on Ethernet, and to implement their services more efficiently using the Kinetic API.

For those vendors who are building out commodity-based scale-out storage, things are even more exciting. It becomes possible to build extremely scalable, highly-available configurations using commodity Ethernet switches. And the servers used to implement the external storage service – Swift, Gluster, Ceph, NFS – are likely to change, too: CPU, RAM for caching, multiple NICs, little or no PCI, a little SSD, – no moving parts. Perhaps someone will integrate one into a top-of-rack switch, to produce a very efficient dense array for cool or cold storage.

IMG0882

A bunch of very smart engineers at Seagate have developed this system (that’s Jim Hughes, allowing me to touch a prototype unit), but they know it won’t be accepted if it’s proprietary. So they’re opening up the protocol, the clients, a simulator for design verification. If everything works out, this will become the new standard interface for disk drives. (And, well, any kind of mass storage.)

This is going to be fun. “Disruptive” seems inadequate.

Dynamic network resource management in OpenStack

In several recent blog pieces, here and here, I’ve noted that the central use case for OpenStack – to implement an AWS-like NIST-compliant infrastructure-as-a-service – has broadened over the last three years. Today, OpenStack is being used (or at least considered) for automating the management of traditional enterprise data centers, including infrastructure and applications which don’t fit the original model very well. We can see this in developments such as enabling multiple hypervisors in a single OpenStack cloud, and adding support for Fibre Channel SANs to the Cinder storage service. We’re also seeing interest in the use of specialized resources to allow performance-sensitive scale-up applications to run under OpenStack.

All of this means that the original vision of cloud infrastructure – homogeneous, pooled, highly abstracted – is being replaced with a more complex environment. We have specialized resources from different vendors, including physical devices and virtual appliances. And when you have a heterogeneous environment, you need some kind of policy based automation to allocate the right resource to the right task. Unfortunately, the OpenStack networking architecture, Neutron, does not accommodate heterogeneity very well, and there is no standardized framework for managing virtual appliances.

I work at Brocade, which has a particular interest in this problem. We sell the most popular virtual network appliance, the Vyatta vRouter. And while we have a broad range of IP and SAN products, most of which are supported in OpenStack, almost all of our customers are running multivendor networks.

So we’ve decided to do something about it. Yesterday we submitted an OpenStack Blueprint for a project called DNRM:

This blueprint proposes the addition to OpenStack of a framework for dynamic network resource management (DNRM). This framework includes a new OpenStack resource management and provisioning service, a refactored scheme for Neutron API extensions, a policy-based resource allocation system, and dynamic mapping of resources to plugins. It is intended to address a number of use cases, including multivendor environments, policy-based resource scheduling, and virtual appliance provisioning. We are proposing this as a single blueprint in order to create an efficiently integrated implementation.

This is being submitted now for discussion in Hong Kong. We also plan to demonstrate a proof-of-concept at the summit. Target for this work is Icehouse.

I’ve submitted a proposal for a DNRM session at the OpenStack Summit in Hong Kong in November at which I’ll present the architectural features and customer benefits. If you’re involved in OpenStack (and even if you’re not!), I hope that you’ll vote for it.

Prepared notes for the “AWS APIs in OpenStack” debate

This evening there’s a meetup in San Francisco at which people will be debating the role of AWS APIs in OpenStack. It should be a lot of fun; at least 150 people have RSVP’d. I’m planning to be there, but just in case I thought I’d put down my position in advance. I’ve blogged about much of this stuff previously, so for regular readers this may be tl;dr.

The Great AWS API Debate.

I want to begin by emphasizing that I’m a longtime supporter for OpenStack. I’ve been involved for the last couple of years; my employer, Brocade, has contributed to both Neutron and Cinder, including FC SAN. My day job is to enable multivendor NFV for OpenStack.

Two years ago, I developed an IaaS architecture for Yahoo that was based on OpenStack. We decided from the start that we would only use the AWS (EC2, S3, EBS) APIs; we would not expose the OpenStack API to users. There were two motivations for this. First, we wanted to be able to leverage the work of the AWS ecosystem; today I would cite the importance of being able to exploit Netflix OSS. Secondly, I was concerned about the stability of the native OpenStack API, and the likelihood of inconsistent feature support and ad hoc extension.

Today, my concerns seem quite justified. What I had not anticipated was that OpenStack development would have been so slow. I had hoped that after three years we might have replicated the major AWS features from 2009, but this has not happened. In large measure, I attribute this failure to the absence of a crisp, immutable set of requirements. AWS API compatibility would have provided these requirements.

A few open source projects have the luxury of being first movers. This was not the case for two of the most successful projects: Linux and MySQL. In each case, the project succeeded by embracing and extending an existing standard API. Linux would not have succeeded without POSIX. MySQL would not have succeeded without SQL. The reasons are straightforward. Implementing a broadly accepted standard reduces the costs of adoption; it also provides a shared goal, a clear set of requirements.

There are two common arguments against focussing on the AWS APIs. The first is that we give up control; that Amazon will set the agenda going forward. This is naive. Amazon’s ability to make radical changes to the API is severely constrained by the long-term needs of their customers. They are not stupid. (Disclaimer: I used to work there.) The second is that it would limit the ability of the OpenStack community to innovate; that it’s too soon to settle on a single definition of IaaS. This would carry more weight if there was actual evidence that OpenStack was out-innovating AWS.

Having said all of this, where do I stand on the current question? Unfortunately, I think that it may be too late. I’m concerned that we’ve accepted so many slightly divergent implementations and extensions that it may be impossible to define an OpenStack subset which does provide the right degree of compatibility. It’s worth re-reading Rob Hirschfeld’s series of blog posts on “OpenStack Core”; the relationship of a “core” to AWS compatibility is really unclear. It’s also important to recognize that OpenStack is being applied to a wider set of use cases than many of us expected. In addition to AWS-style public or private clouds, with uniform pooled resources, OpenStack is being used for general enterprise data center automation, including the management of legacy (and distinctly un-cloudy) resources. That’s not going away.

Tech blogs that I follow

A colleague of mine asked me for a list of the tech blogs that I follow. Little did she know what she was asking for.

Let me make a couple of points clear. First, I don’t read every item in every blog. I use Feedly, so I can quickly scan the subject lines in the unread messages and click on the ones of interest. The first group of feeds only generates about 20-40 items a day, so it’s quite manageable. A few of these are close to moribund, but I’ve left them in the list just in case the authors wake up. (I’ve had several lengthy breaks on my blogs.) My favorites are in bold. The last five feeds are the way I get my tech news of the day. This generates a lot of hits: 40-80 80-120 a day, sometimes more. There’s a lot of duplication, of course. You can improve the signal-to-noise ratio by skipping The Register, but I like the snarky British style that it brings.

I don’t claim that this is a balanced selection of tech sources. It’s biased towards cloud computing and networking, with a bit of security, but mostly it’s about what I’ve found useful over the last ten years or so. (And to my friends and colleagues whom I’ve overlooked: mea culpa – please set me right.) Note that a few of these are not pure tech content. I don’t think any are NSFW, but I can’t guarantee that.

In my Feedly configuration, my tech feeds are tagged as “B-tech“. The news feeds are “T-technews“. This is because Feedly sorts the tagged groups alphabetically. The group “A-mustread” is for my very favorite, mostly non-tech sources: James Fallows, xkcd (and What If?), Tim Bray, John Scalzi, Jerry Coyne, and Andrew Sullivan. I also have groups for atheism, science, philosophy, music, aviation, and political punditry.

Enjoy.

http://nighthacks.com/roller/jag
http://perfcap.blogspot.com
http://aws.typepad.com/aws
http://highscalability.com
http://perspectives.mvdirona.com
http://www.rationalsurvivability.com/blog
http://www.rightscale.com/blog
http://agiletesting.blogspot.com
http://alestic.com
http://www.allthingsdistributed.com
http://www.aristanetworks.com/en/blogs
http://blog.gardeviance.org
https://www.opennetworking.org/component/wordpress/?Itemid=283
http://www.chipchilders.com
http://keepingitclassless.net
http://aneelism.com/blog
http://blog.scottlowe.org
http://bradhedlund.com
http://blogs.bromium.com
http://blog.ipspace.net
http://cloud-computing-today.com
http://shlomoswidler.com
http://www.dzone.com/mz/cloud
http://www.cloudave.com
http://www.cloudbzz.com
http://cloudcomputing.info/en
http://www.cloudsofchange.com
http://buildacloud.org/blog/latest.html
http://www.shapeblue.com/blog
http://www.cloudscaling.com/blog
http://engineering.cloudscaling.com
http://cloudywords.com
http://blog.cohesiveft.com
http://www.cumulogic.com/resources/blog
http://www.25hoursaday.com/weblog
http://www.datacenterknowledge.com
http://davidchappellopinari.blogspot.com
http://etherealmind.com/category/blog
http://eucalyptus.ulitzer.com
http://www.fryguy.net
http://googlecloudplatform.blogspot.com
http://googleenterprise.blogspot.com
http://www.igvita.com
http://it20.info
http://blogs.forrester.com/james_staten
http://www.jedelman.com/index.html
http://www.mirantis.com/blog
http://redmonk.com/jgovernor
http://networkheresy.com
http://www.network-janitor.net
http://networkstatic.net
http://www.projectfloodlight.org/blog
http://openlife.cc/blog
http://www.ossline.com
http://blog.fosketts.net
http://packetpushers.net
http://www.packet-forwarding.net
http://www.pistoncloud.com/blog
http://pluribusnetworks.com/blog
http://www.jroller.com/MasterMark
http://eranki.tumblr.com
http://rajdavies.blogspot.com
http://robhirschfeld.com
http://www.sdncentral.com
http://sebgoa.blogspot.com
http://www.techrepublic.com/blog/the-enterprise-cloud
http://shitmycloudsays.tumblr.com
http://siwdt.com
http://blog.softlayer.com
http://www.speakingofclouds.com
http://srinathsview.blogspot.com
http://www.stackops.com/blog
http://storagemojo.com
http://telecomoccasionally.wordpress.com
http://www.definethecloud.net
http://www.convirture.com/blog
http://diversity.net.nz
http://lonesysadmin.net
http://blog.theloosecouple.com
http://techblog.netflix.com
http://www.openstack.org/blog
http://www.virtualizationpractice.com
http://gevaperry.typepad.com/main
http://nathanmarz.com
http://virtualization.info/en
http://blog.cimicorp.com
http://blog.dshr.org
http://webmink.com
http://dehora.net
http://blog.jcole.us
http://ostatic.com/blog
http://www.schneier.com/blog
http://kaminario.com/blog
http://www.itskeptic.org
http://networkingnerd.net

And for the news:

http://www.theregister.co.uk
http://gigaom.com
http://pandodaily.com
http://venturebeat.com
http://allthingsd.com