Monday, February 25, 2008

Challenges of software as service: Spolsky's version

Amazon's S3 was down the other day. A lot of company's that use S3 were down too.

Such are the woes of 'software as service' -- what we used to call "application service provision".

Amazon didn't say much about the S3 outage, but when something similar happened to Joel Spolsky's product he had quite a few interesting comments:
Five whys - Joel on Software

..Most well-run online services will have two, maybe three outages a year. With so few data points, the length of the outage starts to become really significant, and that's one of those things that's wildly variable. Suddenly, you're talking about how long it takes a human to get to the equipment and swap out a broken part. To get really high uptime, you can't wait for a human to switch out failed parts. You can't even wait for a human to figure out what went wrong: you have to have previously thought of every possible thing that can possibly go wrong, which is vanishingly improbable. It's the unexpected unexpecteds, not the expected unexpecteds, that kill you.
...Think of it this way: If your six nines system goes down mysteriously just once and it takes you an hour to figure out the cause and fix it, well, you've just blown your downtime budget for the next century. Even the most notoriously reliable systems, like AT&T's long distance service, have had long outages (six hours in 1991) which put them at a rather embarrassing three nines ... and AT&;T's long distance service is considered "carrier grade," the gold standard for uptime.

Keeping internet services online suffers from the problem of black swans. Nassim Taleb, who invented the term, defines it thus: "A black swan is an outlier, an event that lies beyond the realm of normal expectations." Almost all internet outages are unexpected unexpecteds: extremely low-probability outlying surprises. They're the kind of things that happen so rarely it doesn't even make sense to use normal statistical methods like "mean time between failure."...
Spolsky cares deeply about customer service, his company's response is impeccable.

Others don't do nearly as well. I have trouble imaging large corporations caring enough to delivery truly reliable service, though the phone companies (for all their many ills) managed it for many years.

No comments: