Saturday, May 14, 2011

Reliability and the Cloud - Redundancy required

Hardly anyone noticed, but yet another Google cloud service failed this week. There was understandably more attention to Amazon's recent service failure (2008 too). These aren't a surprise, I've had my share of complaints with Google's cloud services.

Despite all of the problems with Cloud services, of which the most serious is Cloud provider bankruptcy, Amazon and Google are relatively reliable. In my corporate workplace, the average worker loses 2-5 days of work each year due to machine upgrades, backup failures and hardware failures. Cloud services aren't quite that bad, but corporate IT is a low standard. Cloud services aren't good enough.

The answer to Cloud reliability, is redundancy. The designers of the late 20th century American space shuttle knew this well ...

... The shuttle uses five identical redundant IBM 32-bit general purpose computers (GPCs), model AP-101, constituting a type of embedded system. Four computers run specialized software called the Primary Avionics Software System (PASS). A fifth backup computer runs separate software called the Backup Flight System (BFS). Collectively they are called the Data Processing System (DPS)....

The design goal of the shuttle's DPS is fail-operational/fail-safe reliability. After a single failure, the shuttle can still continue the mission. After two failures, it can still land safely.

The four general-purpose computers operate essentially in lockstep, checking each other. If one computer fails, the three functioning computers "vote" it out of the system...

The Backup Flight System (BFS) is separately developed software running on the fifth computer, used only if the entire four-computer primary system fails. The BFS was created because although the four primary computers are hardware redundant, they all run the same software, so a generic software problem could crash all of them ...

It's not hard to do the math. A series of 5 procedures each with 90% reliability has a 40% chance of failure (1-0.9^5). A different system with 5 systems of similar reliability run in parallel has a 0.001% (.1^5) chance of failure.

In Cloud terms similar redundancy can come from multiple service providers, with the ability to switchover. File requests, for example, could be alternately routed to both Amazon S3 and to a corporate owned server. Reliability comes from two very different systems with uncorrelated failure probabilities [1][2].

This switchover requirements requires Cloud services to be dumb utilities - or to support some kind of local cache. To safely use Google Docs, for example, there has to be some way to fail over to a local device, perhaps by synchronizing files to a local store. Similarly, to use a Cloud blogging service one would want control of the domain name, and blog software that published to two services simultaneously. In the event of failure, the domain name could be redirected to the redundant server.

None of this is new. Back in the days when Cloud services were called "Application Service Providers" (ASP) I went through the same reasoning process with our web-based Electronic Health Record. I'm sure there were very similar discussions in the 1970's era of 'dumb terminals'. These things take time.

We'll know they Cloud is maturing when failover strategies become ubiquitous. Of course by then we'll call the Cloud something else ...

[1] Of course then the switch fails. There are always failure points, the trick is to apply redundancy to those that are least reliable, or where redundancy is most cost-effective. The Shuttle, infamously, couldn't survive a failure on launch of its solid fuel system.
[2] From a security perspective, two systems like this are two sources of security failure. Multiple systems increase reliability, but decrease security.

2 comments:

Anonymous said...

I think that the security perspective depends on both the system design and the threat model. The system you describe, for instance, would be more vulnerable to data theft but less vulnerable to denial of service attacks.

The principle of diversity is useful in both security and reliability contexts. There has even been work on creating artificial diversity for increased security.

JGF said...

I see what you mean, there's a fuzzy boundary between security and reliability. I was thinking more in terms of data theft - good observation.