Reinventing storage – Ethernet über alles!

This is long. For the “tl;dr” crowd: Seagate just reinvented the disk interface using Ethernet, TCP/IP, Protocol Buffers, and key-value objects. And it’s really, really cool.

But if you have a few minutes to spare…

When I first got involved in data center networking in the early 1980s, there were several competing technologies. The two leaders were Ethernet and Token Ring, and although Bob Metcalfe had invented Ethernet, his first company 3Com actually sold both. Within a couple of years, economics, obstinacy by IBM, and a patent troll had taken Token Ring out of the picture, and Ethernet ruled. It quickly evolved from its shared-media topology: in 1987 SynOptics introduced the first Ethernet Hub, and two years later Kalpana broke the mold with the first Ethernet switch. Many of us concluded that whatever the future LAN technologies might look like, they would be called Ethernet.

The history of protocol stacks roughly paralleled that of LAN technology. In the early 1980s there were many candidates – NetWare, XNS, ARCnet, NETBEUI, OSI, AppleTalk, and others, as well as TCP/IP. By the end of the decade, TCP/IP had won. Some companies rehosted their application protocols on top of TCP/IP (I’m ashamed to say that my name is on the RFCs for NetBIOS-over-TCP), but most disappeared or pivoted away, like Novell.

Over the last 20 years, we’ve seen a steady process of convergence around Ethernet and TCP/IP. (Metro Ethernet is a fascinating and unexpected example.) Fibre channel was introduced in 1988 as a replacement for HIPPI in storage area networking. Twenty years later some companies tried to layer the FC protocols directly over Ethernet (FCoE). Most regard this as a failed experiment: although it slightly simplified cabling, the FC protocols were too inflexible to work well in a noisy LAN, and the lack of routability conflicted with data center networking practices. Instead, people started to experiment with storage protocols running over TCP/IP: iSCSI for block access, S3-like HTTP-based protocols for moving large objects around, and the perennial NFS and CIFS for file access.

One area that has so far remained untouched by this process of convergence is the connection between storage devices and computers. Even though the actual technologies have evolved – IDE, ATA, ATAPI, PATA, SCSI, ESDI, SATA, eSATA – the most common storage interconnection topologies are pretty much the same that IBM introduced with the S/360 mainframe in 1964: a controller device integrated into the computer, communicating with a small number of storage devices over a private short-range interconnect. The “private” bit is important; although various techniques have been created for shared (multi-master) access to the interconnect, all were relatively expensive, and none are supported by the consumer-grade drives which are often used for scale-out storage systems.

Historically, storage servers have been constructed as “black box” turnkey systems, from the Auspex NFS servers in the 1980s to the storage arrays from vendors like EMC and NetApp. More recently, people have been constructing interesting scale-out storage services from commodity hardware, using an x86 with a tray of consumer grade disks as a building block. However these architectures are constrained by the single point of failure and performance bottleneck introduced by the private interconnect between CPU and disks. (One odd consequence is that it is often hard to put together a economic “proof of concept” system, because the scale-out algorithms perform poorly with a small number of nodes.)

Over the years there have been various attempts at re-inventing this pattern. Most of these are based on the idea of moving more of the processing to the disk itself, taking advantage of the fact that every disk already has a certain amount of processing capacity to do things like bad sector remapping. Up until now, these efforts have been unsuccessful because of cost or architectural mis-match. But that’s about to change.

Yesterday Seagate introduced its Kinetic Open Storage Platform, and I’m simply blown away by it. It’s a truly elegant design, “as simple as possible, but no simpler”. The physical interconnect to the disk drive is now Ethernet. The interface is a simple key-value object oriented access scheme, implemented using Google Protocol Buffers. It supports key-based CRUD (create, read, update and delete); it also implements third-party transfers (“transfer the objects with keys X, Y and Z to the drive with IP address 1.2.3.4”). Configuration is based on DHCP, and everything can be authenticated and encrypted. The system supports a variety of key schemas to make it easy for various storage services to shard the data across multiple drives.

I love this design.

Don’t fall into the trap of thinking that this means we’ll see thousand upon thousands of individual smart disks on the data center LANs. That’s not the goal. (Or I don’t think it is.) EMC or NetApp can still use these drives to build big honking storage arrays, if they want to. The difference is that they have much more freedom in designing the internals of those arrays, because they don’t have to use one kind of (severely constrained) technology for one kind of traffic (disk data) and a completely different kind of technology for their internal HA traffic. They’re free to develop new kinds of internal topologies based on Ethernet, and to implement their services more efficiently using the Kinetic API.

For those vendors who are building out commodity-based scale-out storage, things are even more exciting. It becomes possible to build extremely scalable, highly-available configurations using commodity Ethernet switches. And the servers used to implement the external storage service – Swift, Gluster, Ceph, NFS – are likely to change, too: CPU, RAM for caching, multiple NICs, little or no PCI, a little SSD, – no moving parts. Perhaps someone will integrate one into a top-of-rack switch, to produce a very efficient dense array for cool or cold storage.

IMG0882

A bunch of very smart engineers at Seagate have developed this system (that’s Jim Hughes, allowing me to touch a prototype unit), but they know it won’t be accepted if it’s proprietary. So they’re opening up the protocol, the clients, a simulator for design verification. If everything works out, this will become the new standard interface for disk drives. (And, well, any kind of mass storage.)

This is going to be fun. “Disruptive” seems inadequate.

13 Comments to "Reinventing storage – Ethernet über alles!"

  1. Ladd's Gravatar Ladd
    October 25, 2013 - 1:29 AM | Permalink

    Very cool, like all brilliant ideas it is obvious once you see it.

  2. tres fou's Gravatar tres fou
    October 25, 2013 - 5:25 AM | Permalink

    Hmm. interesting. I wonder what the imagined use-cases for this would be. And what the latencies of the system would be.

  3. October 25, 2013 - 6:12 AM | Permalink

    “The “private” bit is important; although various techniques have been created for shared (multi-master) access to the interconnect, all were relatively expensive, and none are supported by the consumer-grade drives which are often used for scale-out storage systems.”

    I was working on multi-master storage systems using parallel SCSI in 1994. Nowadays you can get an FC or SAS disk array for barely more than a JBOD enclosure. Shared storage is neither new nor expensive. It’s not common at the single-disk layer, but it’s not clear why that should matter.

    The idea of network disks with an object interface isn’t all that new either. NASD (http://www.pdl.cmu.edu/PDL-FTP/NASD/Talks/Seagate-Dec-14-99.pdf) did it back in ’99, and IMO did it better (see http://pl.atyp.us/2013-10-comedic-open-storage.html for the longer explanation.

    “Don’t fall into the trap of thinking that this means we’ll see thousand upon thousands of individual smart disks on the data center LANs. That’s not the goal.”

    …and yet that’s exactly what some of the “use cases” in the Kinetics wiki show. Is it your statement that’s incorrect, or the marketing materials Seagate put up in lieu of technical information?

    “they don’t have to use one kind of (severely constrained) technology for one kind of traffic (disk data) and a completely different kind of technology for their internal HA traffic.”

    How does Kinetic do anything to help with HA? Array vendors are not particularly constrained by the interconnects they’re using now. In the “big honking” market, Ethernet is markedly inferior to the interconnects they’re already using internally, and doesn’t touch any of the other problems that constitute their value add – efficient RAID implementations, efficient bridging between internal and external interfaces (regardless of the protocol used), tiering, fault handling, etc. If they want to support a single-vendor object API instead of several open ones that already exist, then maybe they can do that more easily or efficiently with the same API on the inside. Otherwise it’s just a big “meh” to them.

    At the higher level, in *distributed* filesystems or object stores, having an object store at the disk level isn’t going to make much difference either. Because the Kinetics semantics are so weak, they’ll have to do for themselves most of what they do now, and performance isn’t constrained by the back-end interface even when it’s file based. Sure, they can connect multiple servers to a single Kinetics disk and fail over between them, but they can do the same with a cheap dual-controller SAS enclosure today. The reason they typically don’t is not because of cost but because that’s not how modern systems handle HA. The battle between shared-disk and shared-nothing is over. Shared-nothing won. Even with an object interface, going back to a shared-disk architecture is a mistake few would make.

  4. Ivanna Humpala's Gravatar Ivanna Humpala
    October 25, 2013 - 6:36 AM | Permalink

    Interesting, but what about performance? Unless you use 10GigE, you won’t see the performance value as SATA.

    • jrjr's Gravatar jrjr
      October 25, 2013 - 6:14 PM | Permalink

      but at least the interface won’t be intermittent half the time, like SATA

    • Charlie Pearce's Gravatar Charlie Pearce
      October 26, 2013 - 4:29 AM | Permalink

      Perhaps this could be the application that finally drives the cost of 10GigE down.

  5. Philip de Louraille's Gravatar Philip de Louraille
    October 28, 2013 - 1:11 PM | Permalink

    First the cloud, now this. The NSA thanks you. Makes everything available…

  6. October 30, 2013 - 7:31 PM | Permalink

    I would like to see Kinetic drives cluster together using some sort of swarming and distribution algorithm, and completely remove the need for middleware like OpenStack Swift, Hadoop etc. The applications would directly talk to the Seagate Kinetic drives using REST APIs. Is this possible?

    • October 30, 2013 - 9:17 PM | Permalink

      It’s not really feasible, Saqib. At the very least, those clients need to have an accurate picture of where all the disks are. Otherwise, two clients might create the same key on two different disks, with different contents. That’s not even eventual consistency; it’s just a mess. Who’s going to maintain that information, or that needed for security? Who’s going to recognize when data needs to be re-replicated, or rebalanced across an ever-changing set of drives? Sure, you can have the clients do all that, but as soon as you start putting authoritative information on clients and expecting them to look after it well then they’re partly servers. That in turn gets you into a whole world of problems associated with a large and unstable set of servers, some of them not configured optimally for the role, and that’s an even harder problem than more explicitly server-oriented systems have.

      In some ways the system you describe is a lot like how GlusterFS (the project I work on) is structured. We’ve actually had to solve a lot of the coordination problems that Seagate’s “markeneers” don’t even seem to know about. We try to keep the lowest-level “bricks” as dumb as possible, with as much logic as possible on the clients, and still we rely on richer semantics than Kinetic has to maintain adequate behavior and performance.

      Before anything like this could work, the Kinetic folks would have to write about ten times more code then they have already, to run on drives that are then effectively micro-servers. They’d probably need faster processors and more memory too – increasing cost and power/heat, decreasing density. At that point why not just use something like the ARM micro-servers that are already here? I have a quad-core one upstairs right now, about the size of a credit card. Tweak that a little bit, amortize the cost of one over several 2.5″ drives as they can be bought for peanuts today, and you have something that can beat Kinetic in every dimension.

Comments are closed.