How I Spent My Sunday (Ufies hardware problems)

I’m scared to see what happens if you plug ufies.org hardware problems into google. Here’s how my last Sunday was supposed to be spent. Firefly was working so basically my plan was something like this:

Sleep in long enough to be rested (done)
Go for run (done)
Have nice relaxing breakfast (done)
Relax some more (done)
Do laundry, dishes, and a bit of relaxing
Towards the evening, make dinner (done)
Eat dinner, relax watching some TV (done)
Get to bed at a decent hour (done)

As has been said before, the best laid plans of mice and system admins go oft awry, and such is what happened to me….

I did actually get to do up to the start of “do laundry” on the above list. I was just getting ready to lug clothes down to the laundry room when I got a message from Fred saying that there was a problem with one of the RAID arrays on Ufies.org. I had lost an element on one of the two arrays about a month ago and it was running on the spare (still had three out of three disks running fine). I figured that that’s pretty sucky, but I could move the spare parition from the first array over to take it’s place.

The array started reconstructing, but a bit into it something blew up and now I had 1 out of 3 disks in the array (which shouldn’t happen, as RAID5 needs at least 2/3 to run). I found that while the array was still up, and I could see files and directories, I couldn’t copy data around at all (ie: trying desparately to copy vital information to safety). Surprisingly things like the webserver and database stayed up for a while (guess the files were in cache or memory or something), but eventually SQL errors started kicking up, and things started going down the shitter from there. Around this point I knew that I’d have to go in there and recover things by hand. Fred had a spare 40G disk that I could use as a replacement so I ran through to richmond and then into deep downtown Vancouver to the Data Fortress location.

The reasons that this all started is that Data Fortress was having their location on the power grid moved, so they had a scheduled power outage around 4am on Sunday morning. They had a login that would allow them to shut down the system properly, but the power off happened early, or late, or something like that. I think that due to this the RAID array had to reconstruct itself (as software raid has to do after a non-clean power off) and in that reconstruction phase it hit a bad block or something and kicked the disk out of the array.

When I arrived an hour and a half later or so I found an interesting setup. They had a large generator in the parking with a long power cable running down the hall and to their offices. From there it branched into a rack of power bar like things which had what looked like 20 or 50 bright yellow extension cords running from it into the server room and plugging into various machines, UPSs and other needed equipment. All the lights were off and there were fans everywhere. Only a few desklights around the floor provided any light. Because the A/C wasn’t powered the place was very hot, probably 75-80 or so. Oh, and the best part is because the generator was running right outside the door, there was the smell of gas and the taste of exaust in the air (gotta love breathing carbon monoxide!). Needless to say everyone took breaks periodically to go out and breathe air that wasn’t poisonous 🙂

I started by trying to use Knoppix and then a Mandrake install disk to try to get a hold on what damage was done to the system. Sadly the former didn’t support the Promise PDC20269 controller (or didn’t show the extra hard drives if it did) and the latter (which did detect the drives) had bupkis for tools that were useful for doing any sort of recovery. I knew that the Gentoo install disks would have worked fine, but I didn’t have time to download them ahead of time, and couldn’t find my burns of them.

I ended up booting up the system off the hard drive in single user mode and taking stock of things. The array came back up with 2/3 devices running, but lots of other problems. When I tried adding the spare in again it kicked out to 1/3 again, but this time I could see what had happened. Hardware errors on a completely separate drive (/dev/hdf) came up. I guess the RAID wasn’t detecting that for some reason. So I ripped open the box (yay for no screw entry!) and figured out what drive was the one causing the problems and replaced it. Minor heart attack when the SCSI drive that /home is on didn’t come back up on boot, but after taking the box down again and wiggling power connectors a bit it came back up.

I repartitioned the new drive and then inserted them one by one back into the arrays, watching the reconstruction carefully. Both arrays were soon back to 3/3. A reboot showed no more errors, so I was starting to think that I’d get out of this ok. Then I tried to mount /var from the second RAID array (which had the problems originally).

(paraphrased) “This does not appear to be a valid EXT3 filesystem”

“Oh darn” I said. Actually I think it was more like “shit fuck goddam mother fucker peice of shit kill kill kill kill kill” but I can’t be sure.

I tried to recover from the alternative superblock, but no go. So without much choice I re-created the filesystem on the array (losing everything on it of course). Now in anticipation of this after a similar crash some time ago i started doing backups of important files in /var.



/var/mail 

/var/spool 

/var/lib/dpkg

/var/lib/mailman

/var/lib/apt

/var/lib/jabber

/var/lib/mysql

/var/lib/postgres

Luckily I’ve been doing local backups of these files, so I just had to copy them back onto the newly created /var. Some minor fix ups to perms and directories so that files could write out there log files and lock files was next (much thanks to Wim for having his machine running Debian unstable as well, allowing me to copy the layout of files and directories easily. Some packages required me to run apt-get install –reinstall <packagename> to re-populate the information that was lost, but mostly things seem to be ok.

Currently the RAID arrays are back up at 3/3 and after a bit of tweaking of various programs and files everything seems to be running ok. I did lose all of the FTP server, but luckily the only person that really uses it anymore had a backup and has copied the files back onto it. Data in the database and new mail from between around 7am and 10am were lost, but all in all, not a horrible crash. Goes to show that even RAID isn’t the be-all and end-all of preventing hardware screw ups. I’ll be requesting advice on how the best way to update the box would be sometime in the near future.

After a couple of tests to make sure things were running OK I moved the box back, packed up and went home. The power came on around 5pm, so the server room was starting to get cooler again and the guys there were busily moving cables back to where they should be. Aside from the stress of having your box down, the guys there were very nice and it was good to chat with them and geek out about just about everything, including some good ideas about hardware vs software raid and IDE vs SCSI.

Around 6:30 or 7 I packed up and headed home, glad the traffic wasn’t that bad. Firefly was home but still exausted from a day of driving and working on the jobsite, so I had supper ready eventually, got laundry done and even managed to get to bed at a decent hour, all on my original list of things to do. I guess in the end I only really substituted “relax” with “work on computer in downtown Vancouver”.

Related