Live, learn, fix, repeat: a conversation with Derrick Harris

On Monday I had a long conversation with Derrick Harris of GigaOM, which turned into a piece entitled How to deal with cloud failure: Live, learn, fix, repeat. This was very well received (“one of the most intelligent articles I’ve read on this topic”), and I’d like to add a couple of related observations that we couldn’t work into the conversation.

Large scale cloud computing systems are characterized by two important properties. First, many of their functions are asynchronous, not transactional. The user issues an API call to start a VM in EC2 or OpenStack, the service acknowledges the request, and at some point later on the operation is completed. Usually. Perhaps not; maybe the request will fail. This is hard enough to deal with; when we start to compose multiple asynchronous mechanisms, the complexity can spiral out of control. For example, what happens when we bring together:

  • An autoscaling system which monitors traffic and adds or removes capacity (VMs) as needed.
  • A software deployment system which can roll out a new release to an existing fleet of machines at a controlled rate, and supports roll-back if the new software proves buggy.
  • A self-healing mechanism which detects failed or “sick” instances of a fleet of VMs, and restarts or replaces them.
  • A fault injection system (like the Netflix “Chaos Monkey”) which exercises the robustness of the system by periodically killing components.
  • And all of these using an asynchronous API for managing VM instances.

While it may be possible to predict the eventual state of such a system, the path to that state, and the time to reach it, is extraordinarily hard to wrap your head around.

The second characteristic of today’s cloud systems is the massive levels of replication. S3 or Swift objects are stored in multiple data centers, service instances are replicated behind load balancers, load-balanced clusters are replicated in multiple data centers behind DNS anycasting or round-robin, and so forth. This is nothing new: we’ve been replicating, clustering, and doing fail-over since the 1960s. However at scale we cannot always afford to wait until replication has been completed – we can’t wrap a neat transaction around it – and we have embraced designs based on “eventual consistency” as a way of boosting throughput.

Put these two trends together, and we have systems that are radically asynchronous, massively replicated, and non-deterministic. These systems are hard to work with. Developers and operators like things to be transactional, linear, deterministic: “get used to disappointment”. We’re still learning how to work with systems like this, and we won’t always get it right. I was particularly peeved to read a simplistic assessment of the recent AWS outage in which the author claimed that…

… affected customers — Instagram, Pinterest, Pocket and Netflix, which all suffered from Amazon’s recent outage on the weekend — hadn’t used the ability of the cloud to create geographically redundant links.

“They could operate at a higher level of redundancy, so that these sort of outages would only have a minimal impact on them. It’s a matter of cost,” Bettin said.

In a word, bullshit. I know many of the guys who built these systems, and they have all incorporated high levels of redundancy in their system. None of them are “backup cheapskates” – that’s an insult which simply betrays the author’s failure to appreciate the complexity of the domain. I don’t know all of the root causes of their outages, but I’m prepared to bet that most were due to bugs arising from complex interactions between multiple systems, of the kind that I discussed earlier.

As I said to Derrick, I don’t expect the failure rate to plateau any time soon, because we are continually expanding scale, pushing performance, and introducing new sources of complexity. I do think that our tools for modeling, diagnosing, and repairing these systems are improving rapidly, however. I’ve mentioned Chaos Monkey as an example of fault injection techniques; I also expect to see more systems adopting anti-entropy elements, and dedicating a portion of the capacity of every system to in-production testing of functionality and performance.

1 Comment to "Live, learn, fix, repeat: a conversation with Derrick Harris"

  1. July 15, 2012 - 5:36 PM | Permalink

    +1

    From an operational perspective many think you need to "own" the infrastructure to prevent these outages. What is often overlooked is that these systems often have extended outages. Further, you have fewer people working on the problem. At system scale you often get more intellectual scale because of the number of people working on the problem.

    As you know, someone once said "most of he smart people do not work for your company". The key is finding how best to leverage all those smart people.

    Yes, Live, learn, Fix, Repeat . . . And it will happen faster with more intellectual power being added to the fixers all the time.

Comments are closed.