Thursday, March 03, 2005

Creating a reliable system out of unreliable components: Google and massive redundancy

Google's secret of success? Dealing with failure | CNET

For Google, reliability is an emergent property of the system. An individual cell (computers) is not that robust. Cells die and are taken out of commission every day. The system, however, is very robust.

Sound familiar?

That's how your body works. Individual cells are not all that reliable. They are constantly mutating, breaking down, getting infected, becoming immortal (bad). The human body, however, is reasonably reliable -- we don't crash every day. Reliability is an emergent property of large numbers of unreliable components.

Google wasn't the first enterprise to product reliability through redundancy. The space shuttle flies with (I think) six computers. Five are identical, one is quite different. They all have to agree on their outputs.

What's the lesson for home? I'm not sure. I need to think about that one. System reliability (phone, pda, server, desktop, laptop, iPod ... ) is a big headache in our household. I know I need more reliable and less troublesome tools.

No comments: