Still More Tech Stuff

Caution: Technobabble ahead.

Somewhere along the way I lost a disk in my RAID array on ufies. In theory, it’s running RAID-5 (which is defined as striping with distributed parity) with 4 disks, so you have three disks in the RAID array, and one spare. Seems all fine and dandy doesn’t it? And everything on UFies is running just fine right?

Well I had noticed some strange messages a while back in the dmesg output:

md0: no spare disk to reconstruct array! — continuing in degraded mode […] raid5: device hdb4 operational as raid disk 1 raid5: device hda4 operational as raid disk 0 raid5: md1, not all disks are operational — trying to recover array raid5: allocated 3291kB for md1 raid5: raid level 5 set md1 active with 2 out of 3 devices, algorithm 0 RAID5 conf printout: — rd:3 wd:2 fd:1 disk 0, s:0, o:1, n:0 rd:0 us:1 dev:hda4 disk 1, s:0, o:1, n:1 rd:1 us:1 dev:hdb4 disk 2, s:0, o:0, n:2 rd:2 us:1 dev:[dev 00:00]

… and so on. Everything ran fine, so I sort of ignored it. I know about raidhotadd, but I haven’t used it before and wasn’t confident enough that it wouldn’t simply blow the entire disk away to use it.

A bit of investigation revealed that for some reason disk three and four were not being used at all! I knew about disk four… I had installed it a while back but never inserted it into the array (sorry fred) due to the above mentioned fear of screwing everything up. However, why the third disk wasn’t used I have no idea. Maybe when I set everything up I set it up wrong and it’s never worked? Could be….

This morning I set up a vmware session, set up several virtual hard disks of 100 megs each, and played around. I was impressed at how well the different RAID tools prevented you from shooting yourself in the foot.

So I crossed my fingers, legs, and toes, said a short prayer to the computer deities (especially the one in charge of people who blow up their servers remotely), and ran raidhotadd /dev/md0 /dev/hdc3

And nothing happened. Nothing bad anyway. Nothing crashed, nothing halted, and no one reached through the monitor to smack me and say “you fool!”. I catted /proc/mdstat and saw that instead of:

md0 : active raid5 hdb3[1] hda3[0] 37142144 blocks level 5, 32k chunk, algorithm 0 [3/2] [UU_]

It now said:
md0 : active raid5 hdc3[3] hdb3[1] hda3[0] 39061888 blocks level 5, 32k chunk, algorithm 0 [3/2] [UU_] [==>………………] recovery = 12.6% (2470008/19530944) finish=91.4min speed=3107K/sec unused devices:

So hopefully in 91.4 minutes or so I’ll be able to look again and see that it’s all there and well. At that point I’ll do the same with the second raid array (/dev/md1) and add in the fourth disk (the spare).

In the mean time I think I’ll go and sacrifice a goat or two to the SCSI deity.

Related