UFies.org Status Report

The UFies server sitting on my desk at home
You have no idea the depth of hatred I feel for computers right now. To make a long and painful story short: Box down for a couple of days. Hardware bad. To make it a bit longer than that and see a picture of the 10 hard drives stuffed in the huge case (3xSCSI, 4xIDE, 1xIDE (backup drive), 1 floppy and 1 CD), read on.



  • 8:17 – Aquire Tim Hortons Mocha. This is the end of the good things in this story.
  • 9:40 – Arrive at the Data Fortress hosting center
  • 9:40am-2:00pm – Back up all data to a spare 80G hard drive that Fred brought. Then we decided to boot up on a CD and make sure we could still access the drive and that the data was safe (after which the drive was going to be disconnected so that the backup was safe and I was free to do whatever I wanted partitioning and formatting wise). I/we ended up fighting with what appeared to be a bad cable or a bad IDE controller on the motherboard. The symptons were that when the drive booted up it got about 5mb/s transfer rate (something I didn’t notice when I was backing data up, other than it was slower than it normally is), and cause “drive not ready” and DMA errors when DMA was turned on. In the end I basically said “the data is there, I can deal with a slow drive, guess the IDE channel on the MB is going”, and headed off to lunch.
  • 2:00-3:00pm – Lunch at the food court. Mmm….. sushi…..

  • 3:00-6:00pm – Install Gentoo, originall with the idea that the SCSI drives can now be turned into system drives, and the IDE drives can make up the /home partition. However, the SCSI drives would intermittently freeze the system up solid when mounting a drive on the SCSI partition or when formatting the newly created RAID5 array on the SCSI drives. In the end I said (paraphrased) “oh darn” and decided to go back to IDE for the system, figuring it was something odd in the gentoo kernel.



    I created the IDE RAID arrays and started the install. Gentoo is a slow install to begin with, but it was slowed even more (I realized this only after) due to the CDROM probably getting the same horribly slow 5mb/s transfer rate that the spare hard drive was getting on the IDE channel on the motherboard (where the CDROM was connected). Slow to copy files, then the slow process of setting up various files, slow to sync the portage tree, slow everything. I’m sure part of it was due to it getting towards the end of the day and me feeling things come down to the wire once again. Compile the kernel, run through it’s options, make sure things are all good, things are looking up.
  • 6:00-6:30pm – The install is almost done, all that is left is to configure GRUB, compile SSH so I can access the system remotely to continue configuring, and then I can go home! All the time the RAID5 has been resyncing the 70G partition I created. This is because it’s the first time that it’s been set up so that’s what it does. The syncing has slowed down my install due to the file writing in the background and the install has slowed down the resyncing.



    Right at the end (I assume) of the resync the same fucking error message is spit to the screen. DMA error, kicking /dev/hdf out of the RAID array, starting to resync with the hot spare.



    AAAAAAHHHHHHHHHHH!!!!!!



    At this point I consider committing hari-kari but decided to give the hardware the benifit of the doubt. I had had problems with the drive before, even though it was a new drive and maybe, just maybe, it was actually a bad drive. I can let it resync in the background and just continue on.



    Last step, configure the boot loader. I run grub and tell it to boot off of /dev/hde1 for me. Grub comes back and tells me “I’m sorry dave, I can’ t do that.” Why not? “Can’t mount the partition.” But the partition is right there? “No it’s not!” I can see it in your command completion! “No you can’t! You’ve had 8 hours of sleep over the last couple of days and just spent 8 hours with your neck craned up at a monitor that’s horribly placed for anyone sitting down, you’re eyes are starting to go and your hands are shaking, you don’t know what’s going on!” Ok, maybe you’re right.



    So I decide to run fdisk to see if maybe the partition was tagged as something it wasn’t supposed to be. Fdisk came up with an empty partition table and a message about some sort of error that will be corrected by a write.



    WTF? WTFFF???



    Check all the IDE drives (/dev/hd[e-h]), all had an empty partition table and the same error message. All the drives were on a secondary PCI IDE controller so they wouldn’t be affected by the (theoretical) bad IDE channel on the HD



    Technically I could recreate the partitions (the scheme was pretty easy) and things would most likely be fine, but at that point we realized that something beyond a bad drive, or a bad cable, or some oddness with the kernel was going on and I packed up and threw the server in my trunk as it’d be easier to deal with at home not under the watchful eye of an employee who is wanting to go home only an hour late.
  • 6:30pm – After 8 hours of completely wasted time, I drove home in the pouring rain, stopping only for $10 in gas and to forget to buy windshield washer fluid.
  • 7:30pm – Arrive home to tea and turkey ceaser salad. Yummy. Guess there was another good thing in the story.

If you read this far I’m impressed at your perseverance. If you skipped to the end to see what the result was after all the bitching was through I’m impressed by your efficiency.


I hope to have the box back up by Monday January 26th in one form or another, be it with the old OS and on borrowed hardware, or on a new motheboard, or something. My sincere appologies to those who rely on the box for mail and websites, I’m working as quickly as I can.

8 Comments on “UFies.org Status Report”

  1. Damn Alan… I just did the exact same thing to llarian.net Wednesday night. Stage1 Gentoo build while trying to keep all the datafiles and such intact and make it as seamless as possible following the downtime. Fortunatly, it went pretty damn well for me, I got all the basic stuff back up and running about 12 hours later.
    That really sucks. (For what its worth, I am completely in love with Gentoo)

  2. Sir, or some such insincere titular address!
    You have been indited by the national branch of the S.P.C.C. for tower-stuffing, have been judged thereby as a lower form of Canuck, and should present your sorry arse to Ottawa for a good spanking.
    Signed…illegible)…on behalf (or befull) of the Society for the Prevention of Cruelty to Computers!

  3. The thing that pisses me off is that I booted up on a mandrake CD when I got home and none of the strangeness happened there. The things that were happeneing didn’t seem like a bad burn of the boot disk, but you never know I guess. Maybe something screwed up with the gentoo boot kernel or something?
    sigh

  4. Back Up

    Well, after yet another ordeal we’re back up, temporarily at least. The intermittant problems that have been plauging this box…

  5. I had almost the same prob on my win2k server computer. It was running slow, but it had been a while since I used that mobo… so I didn’t remember the speed well. Then it rebooted itself. I replaced the power supply (up to 300W from 250 as I now have 3 HDDs [2 are SCSI], a CD-ROM, and a SCSI CD-R), and it seemed even slower, then it bluescreened.
    I rebooted it and it bluescreened on boot. Ran the repair, it bluescreened on boot.
    Then I wiped out the OS partition and tried to reinstall it. The copying files part of the install took over 4 hours. I was suspicious now of the drive/mobo.
    I moved the drive to my other system and got SMART drive errors on boot. I copied the MP3s off it from the 2nd partition then got the drive replaced under warrenty.
    When I got it back I installed 2k on it again and it has worked like I remembered..nice and quick..even for a PII-266 with 224MB RAM. Gotta love when you don’t install 3/4 of the crap that it wants to. Only the OS, Terminal Services, File Sharing, and antivirus.
    Only thing is that now I suspect the main drive in my good system (120GB) might be damaged… good thing it is under warrenty still.

  6. Server back up.. kinda

    Alan had some difficulties with the rebuilt server, so we are limping along with some hardware from Neil until he rebuilds it. You can read more about the ordeal here. Thanks for all the hard work Alan… and thanks for the loan of hardware Neil! THe n…