Tuesday, January 1, 2013

Apparent Multi-Disk Failures...

Seven or eight years ago I worked for FutureTrade (a now-defunct electronic stock and options trading company).  I was their CTO, and our datacenter (our very own, not a co-lo) was one of my responsibilities.  The 300+ servers in that datacenter actually executed our electronic trades, so there was a lot of money riding on their proper functioning.

Most of the servers used disk only for booting and logging, and these just used locally attached disks in a mirror configuration.  Our database servers, however, had fairly large storage requirements for the day (several hundred GB) and their performance was disk-bound.  So those servers used attached RAID5 subsystems. 

One fine day fairly early in my tenure there, we had a disk failure in one of those arrays.  This failure was picked up by monitoring software, and we scheduled a replacement/rebuild (a standard RAID5 activity) for that evening, after trading hours.  When we ran the rebuild, though, we got non-recoverable errors across a total of three drives (including the one that originally failed).  WTF?  Three drives failing simultaneously?  This is so unlikely as to be dismissible, which I did.  There had to be another explanation. 

One possibility was that the RAID controller itself (or its power supply) had failed, not the drives.  So we swapped the drives into another controller - same three drives reported bad.

Another possibility was that two of the three disks had actually failed some time ago, and we never noticed – and that turned out to be the real situation.  It turns out that if bad blocks exist on rarely read sections of the disk, the RAID controller won't know about these problems until the next time the entire disk is scanned – which is something that happens during a rebuild operation, of course.  Bingo.

So we then set up a periodic disk scan on all our RAID5 systems (we had 6 of them).  To our surprise, on the first such scan all 5 of the RAID5 systems that had not failed had bad blocks on disks – bad blocks that would have caused a rebuild operation to fail.  We ended up replacing all the disks in all six of our RAID5 systems, replacing the old consumer-grade disks with enterprise-quality disks (and paying a pretty penny for them!), and then doing the full-disk scans on a weekly schedule.  That ended our multi-disk failures, but it was a bunch of work to maintain.

This post triggered my memory of this incident; the author does a fine job of explaining how these apparent multi-disk failures can occur...

No comments:

Post a Comment