UFies.org Status (again)

Rebooted to fix the kernel exploit that Silverstr blogged about, but the box never came up. Something to do with the SCSI drive not being detected. sigh I really hate this hardware. A lot. Anyway, Fred is going to go in and see if he can convince things to go again. At least there’s a local mirror of /home now, or should be anyway. Guess the scheduled downtime for tomorrow has moved to today.


Update: Fixed, all back up and fine.


Update #2: The moral of the story is if it ain’t broke, don’t fuck with it. I got both raid arrays up and going just dandy with 3/3 drives operational. However, because I had moved things around at one point, they weren’t set right (and didn’t have spare drives showing up). So I figured, no problem, just remove the element that belongs to array 1 that’s in array 2 and re-add the element that’s supposed to be there.


Kernel oops when I tried that….



md: updating md1 RAID superblock on device

md: hde4 [events: 00000051]<6>(write) hde4’s sb offset: 18580864

md: <1>Unable to handle kernel NULL pointer dereference at virtual address 00000f90

[snip]

Stack: f88cd8ee […]

Call Trace: [] [] [] [] []

  [] []

[snip]

md: recovery thread got woken up …

md1: no spare disk to reconstruct array! — continuing in degraded mode

md: recovery thread finished …



So it looks like it removed it ok, but somewhere in there something threw a NULL pointer somewhere. So I’m now waiting for the twenty minutes or so it takes the data-fortress dude to go in and kick the box. I don’t anticipate any problems with it coming back up, but it’s just a pain in the ass. I wonder if /dev/hdb is the real culprit here?


Why can’t I have a system where the hardware is stable?


Update #3: Ok, rebooted, things ok. Added append=30 to the kernel options as Wim suggested (anything other than adding it in lilo.conf and re-running lilo needed?) so that’ll hopefully eliminate this sort of thing in the future.

12 Comments on “UFies.org Status (again)”

  1. Hasn’t the hardware basically been all replaced now, as a result of all the upgrades/etc?

  2. Yup, hence the frustration. The last bad thing was due to a bad hard drive (replaced) but then something went wrong with that one, which makes me think that it’s due to a cable or controller. This downtime was just a result of a kernel module not loading properly, or doing strange things (loading, but reiserfsck complaining that it couldn’t find module_major_8 or something like that). Very odd. I really am looking forward to rebuilding from scratch RSN. I think the unstable debian isn’t helping things. I think gentoo is my next step.

  3. Yeah… I’ve been having problems with unstable where /var/lib/dpkg/available is getting typos/corruption in it. Fixable by vi, but not good. Generally, unstable is ok for me, but periodically dependancy problems happen because packages depend on packages that have been rebuilt but haven’t been uploaded to the servers yet.
    I’m running testing on most of my servers, with the odd package (like Postfix2) from unstable.
    For kernel updates, I’ve started to get into the habit of using the -R argument for lilo, in combination with append=”panic=30″ in lill.conf… so that if something bails, it’ll revert. If success, then set the default to the new kernel and rerun lilo. If !success, fix until it works 🙂

  4. I got that tip from http://trilldev.sourceforge.net/files/remotedeb.html
    This page: http://www.faqs.org/docs/Linux-HOWTO/BootPrompt-HOWTO.html
    says
    “In the unlikely event of a kernel panic (i.e. an internal error that has been detected by the kernel, and which the kernel decides is serious enough to moan loudly and then halt everything), the default behaviour is to just sit there until someone comes along and notices the panic message on the screen and reboots the machine. However if a machine is running unattended in an isolated location it may be desirable for it to automatically reset itself so that the machine comes back on line. For example, using panic=30 at boot would cause the kernel to try and reboot itself 30 seconds after the kernel panic happened. A value of zero gives the default behaviour, which is to wait forever. ”
    So, if an OOPS is a panic, then yes it’ll reboot when an OOPS happens…

  5. One other thing… that -R/panic trick doesn’t really work when remotely upgrading a system from kernel 2.4 to 2.6, because 2.6 requires those new module tools. So if you install the new tools in /sbin, boot to 2.6, it dies, lilo goes back to 2.4, then 2.4 won’t load any modules because /sbin/insmod is for 2.6 and not 2.4. Bah…

  6. Server… I’ve updated 3 servers so far to 2.6.0.
    I don’t have any desktops at the moment 🙁
    I’m been kind of surprised at how stable it is; typically .0 releases are to be avoided.
    The only problem I had with compiling 2.6, is that the IDE stuff wouldn’t compile as modules… had to compile it in 🙁