Sunday, March 12, 2006

Blog Problems

The gods must be crazy. Or at least all pissed off at me. My evidence for this: my blog’s server had a meltdown around noon on Saturday, and the problem wasn’t the usual kind of thing, where a single failure occurred. No, we couldn’t have a normal kind of problem with our server. Of course not. We had to have a special sort of problem…

The first symptom was easy enough to observe: the blog stopped responding. Disappeared into the proverbial black hole. Some quick testing with the old standby “ping” utility confirmed that the server was definitely not alive, in the sense that the operating system was not booted and operating. At this point I couldn’t tell if the server hardware was operating or not.

If you followed my adventures in remote hosting in December and January, you know that I built a cabinet with remote reset and remote power-cycling capability, specifically to deal with situations like this. So with good cheer, and thanking myself for my foresight, I tried a remote reset of the server.

Nothing happened.

OK. Well, that’s why we also built the remote power cycling, because sometimes hardware gets “hung up” in funny modes that only a power cycling can cure. But that’s rare, so I’m beginning to worry at this point. Tried the power cycling.

Nothing happened.

Oops.

Now this looks bad, and most likely it means an outright hardware failure. Or, just maybe, it means that the remote reset and/or power cycling stuff I built didn’t work. And my server is a 40 minute drive from my house, where all my tools and spare parts are. Sigh. Off I go, “down the hill” as we say, out of Lawson Valley, through Jamul, and off to my remote hosting site (in my friend’s basement). Time to grab the server, take it back home, and hook the sucker up to a test environment.

Absolutely nothing happened. No video, no POST beeps. Something quite fundamental — power supply, CPU, etc. — must be broken. Took all the daughter boards out (except video), disconnected all the drives, etc., and tried booting again. This time I started getting really weird symptoms — I could see it starting to boot, then at random points failing and restarting. If I left it in this mode long enough, eventually the boot sequence would get to the point where it figured out that the drives were missing — but even then, after some random interval it would fail again. Ok. This was looking like either (a) some noise or intermittent failure in the power supply, or (b) a less-likely motherboard or CPU problem. Betting on the power supply, I drove back down the hill to a computer store, picked up a new 350W power supply (for $39!), and drove back home to install it. That took two hours, almost entirely for the driving.

With the new power supply, now the server booted very reliably. Ok, I must have guessed correctly on the power supply! Yippee! I figured I could just put it all back together and everything would be wonderful.

Oops. When I plugged everything back in and rebooted, I got an alarming message from my RAID board during power-up: one of my hard drives was either disconnected or not operable. And it wasn’t disconnected, I quickly figured out — it was really dead. Sheesh! Two problems at the same time!! This is very unlikely to be a random occurrence — much more likely is that either the power supply problem caused the drive to fail, or the drive problem caused the power supply to fail. If it’s the former, then I can most likely look forward to other early failures, as the most likely scenario there is some voltage was out-of-spec high, stressing everything connected to it. Not good. So I’m hoping it’s the latter case — the hard drive failed in some bizarre manner that took out the power supply (though this was a high-end power supply that really should be immune to anything the load might do to it). But, since I had no spare disk drives of the right type lying about, I had to go down the hill again, this time to Fry’s electronics, to pick up a new drive and install it. After I did that, the server booted, rebuilt its RAID array, and started up normally. Hooray! And this morning I reinstalled it, which is how you can read this…

Though this was a real pain in the neck sort of experience, I can’t help but think how amazing it is that in America, even someone like me living “out in the boonies” can get fast and inexpensive access to some of the most advanced technology in the world (hard disks, if you’re not already aware of it, are chock full of things that were in the realm of science fiction just a few years ago). This entire event, from failure to repair, took less than 8 hours (the server was down longer only because I didn’t want to wake my friend late Saturday night or early Sunday morning). In another country, it might take weeks to get such components — and I’d have paid a lot more for them. Here, even late on a Saturday evening, I just walked into a store (have you ever been in Fry’s Electronics? It’s quite an experience if you’re in the least geekly…), pick what I want from a large selection, and drive back home with my purchase. Think of all the things that had to happen for me to be able to do that: the scientists and engineers who invented the technology, the operations folks who took that technology from a lab to high-volume production, the distribution channels that got it to San Diego, etc. All of that so that my silly blog server could be repaired quickly and easily. Pretty close to miraculous, I’d say…

No comments:

Post a Comment