Gmail doesn’t use RAID

This guy arranged for thousands of email messages to be sent to his Gmail account at once, and surprise, surprise! It fell over. What a jackass. One way to know for sure he’s a jackass is the theory about compression and RAID.

First, Gmail doesn’t use RAID. It uses the Google Filesystem. It is like RAID, but implemented in user space, which lets Google play games with filesystem semantics you can’t do when you are restricted to one of the Unix filesystems on top of a hardware RAID box. It also leverages the CPU power you get for free when you buy thousands of PCI busses and IDE interfaces. The mainframe guys figured out long ago the key to throughput is lots of I/O paths. The Google Filesystem lets them use lots of CHEAP I/O paths, instead of the insanely expensive I/O paths that Netapp and Sun sell.

As an aside: It is hilarious to watch the other email providers use their giga-buck Netapp filers to try to compete with Google on email account size. What little profit they might have been making from ads is going to go right out the window when the ops team has to buy 20 new NetApps to meet the demand caused by marketing’s little stunt of trying to keep up with Google.

Second, the compression stuff is bogus. Disk is cheap. Buying, powering, and fixing the millions of CPUs it would take to do all that compression would dwarf the savings from the compression. Not to mention adding latency to the UI, which is something Google prides itself on avoiding.

To Google’s credit, this jackass only managed to make his account unusable. Oh, and the accounts of the few thousand other people who have parts of their mail spool files stored on the same disk servers as him (and even those people were relatively less effected, since it’s highly unlikely they all depended on precisely the same set of disk servers that Kevin’s account is on). At least he didn’t manage to destroy the inbound mail servers. This is typically a case where you’d see cascading and amplifying failure. Typically, what you see in a test like that is an overloaded disk server, which would have caused the inbound mail server to queue, then fail. When that inbound mail server was taken out of service, another one would have fallen, and so on. So instead of only having one set of disk servers overloaded, you have that and zero remaining inbound capacity. Cascading and amplifying. Two words sysadmins don’t like very much.

Of course, as of today, two other words Gmail sysadmins probably don’t like very much are “Kevin” and “Rose”.

Leave a Reply