Tonight I went to do the simple task of adding some more memory to the one production Linux box for which I am fully responsible. (Running CentOS 3.9 currently) It was to be a simple addition of 2x1GB sticks bumping this poor machine from 512MB of RAM. I figured it would be pretty quick and painless. I even had no problems with the normally very temperamental fingerprint scanner at the data center. It worked on the first try. I was off to a good start.
I did have to fight with some cables to get a keyboard and monitor hooked up once I got into the rack but that was expected. It is worse because this particular system isn't as deep as the others above and below it making plugging things in very difficult. What I didn't expect though was what I saw when I got the monitor connected. It appeared that one of the drives in the software RAID 1 I have setup failed.
I don't remember the exact errors but since the system was still running and had been for quite a while, I wasn't really worried about downtime. I was just annoyed that I would have to make another late night or weekend trip out to the data center. So I upgraded the memory and made sure everything else was back up before looking into the RAID failure.
The first thing I checked was /proc/mdstat which is "the current information for multiple-disk, RAID configurations".
[root@host raidinfo]# cat /proc/mdstatSo there it is. md2 which is made up of the partitions sda2 and sdb2 is missing sdb2. Fortunately there is just an underscore there meaning it is just not connected. An F would mean it has failed. I was pretty sure that md2 was also the swap partition which is much less of a big deal and makes the most sense. After the upgrade from 512MB of ram to 2.5GB, the box was idling at 754GB so before the upgrade it must have been using that swap partition A LOT!
Personalities : [raid1]
read_ahead 1024 sectors
Event: 3
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md2 : active raid1 sda2[0]
1052160 blocks [2/1] [U_]
md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]
unused devices: <>
To double check, I checked to see what was mounted on each virtual disk:
[root@liquidcs raidinfo]# df -hKnowing that I only have a root, boot and swap partition I knew for sure by process of elimination I was dealing with the swap.
Filesystem Size Used Avail Use% Mounted on
/dev/md1 73G 21G 49G 30% /
/dev/md0 99M 85M 8.9M 91% /boot
none 1.3G 0 1.3G 0% /dev/shm
So next I wanted to see what the system logs had to say about md2:
[root@host raidinfo]# cat /var/log/message* | grep md2
Jul 22 22:19:36 host kernel: md: created md2
Jul 22 22:19:36 host kernel: md2: removing former faulty sdb2!
Jul 22 22:19:36 host kernel: md2: max total readahead window set to 124k
Jul 22 22:19:36 host kernel: md2: 1 data-disks, max readahead per data-disk: 124k
Jul 22 22:19:36 host kernel: raid1: md2, not all disks are operational -- trying to recover array
Jul 22 22:19:36 host kernel: raid1: raid set md2 active with 1 out of 2 mirrors
Jul 22 22:19:36 host kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode
Jul 22 22:19:36 host kernel: md: md2 already running, cannot run sdb2
Jul 22 22:19:37 host kernel: md: md2 already running, cannot run sdb2
Jul 5 07:01:21 host kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode
Jun 4 06:46:44 host kernel: md: created md2
Jun 4 06:46:44 host kernel: md2: max total readahead window set to 124k
Jun 4 06:46:44 host kernel: md2: 1 data-disks, max readahead per data-disk: 124k
Jun 4 06:46:44 host kernel: raid1: raid set md2 active with 2 out of 2 mirrors
So it had failed a over a month ago, ugh. Again, at least it was swap. I am still not sure what happened though. Restoring it turned out to be a simple task. I just ran raidhotadd to add the partition back to the array and waited about a minute. (This was just a small swap partition. Another guy at the data center chatted with me for a few minutes while I was there as he was waiting 3 more hours for a 500GB RAID array to finish rebuilding.)
[root@host raidinfo]# raidhotadd /dev/md2 /dev/sdb2
[root@host raidinfo]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 5
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md2 : active raid1 sdb2[2] sda2[0]
1052160 blocks [2/1] [U_]
[>....................] recovery = 1.8% (20088/1052160) finish=0.8min speed=20088K/sec
md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]
unused devices: <>
[root@host raidinfo]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 5
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md2 : active raid1 sdb2[2] sda2[0]
1052160 blocks [2/1] [U_]
[=================>...] recovery = 88.3% (930124/1052160) finish=0.1min speed=10303K/sec
md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]
unused devices: <>
And then just over a minute later, we are back in business!!
[root@host raidinfo]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 6
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md2 : active raid1 sdb2[1] sda2[0]
1052160 blocks [2/2] [UU]
md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]
unused devices:
Crisis avoided. Thanks to the RAID HowTo guide and this page on recovering from the failure. I really should setup some monitoring and brush up on my RAID knowledge.