Adventures in Switching to Linux: CentOS

Tonight I went to do the simple task of adding some more memory to the one production Linux box for which I am fully responsible. (Running CentOS 3.9 currently) It was to be a simple addition of 2x1GB sticks bumping this poor machine from 512MB of RAM. I figured it would be pretty quick and painless. I even had no problems with the normally very temperamental fingerprint scanner at the data center. It worked on the first try. I was off to a good start.

I did have to fight with some cables to get a keyboard and monitor hooked up once I got into the rack but that was expected. It is worse because this particular system isn't as deep as the others above and below it making plugging things in very difficult. What I didn't expect though was what I saw when I got the monitor connected. It appeared that one of the drives in the software RAID 1 I have setup failed.

I don't remember the exact errors but since the system was still running and had been for quite a while, I wasn't really worried about downtime. I was just annoyed that I would have to make another late night or weekend trip out to the data center. So I upgraded the memory and made sure everything else was back up before looking into the RAID failure.

The first thing I checked was /proc/mdstat which is "the current information for multiple-disk, RAID configurations".

[root@host raidinfo]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 3
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sda2[0]
1052160 blocks [2/1] [U_]

md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]

unused devices: <>

So there it is. md2 which is made up of the partitions sda2 and sdb2 is missing sdb2. Fortunately there is just an underscore there meaning it is just not connected. An F would mean it has failed. I was pretty sure that md2 was also the swap partition which is much less of a big deal and makes the most sense. After the upgrade from 512MB of ram to 2.5GB, the box was idling at 754GB so before the upgrade it must have been using that swap partition A LOT!

To double check, I checked to see what was mounted on each virtual disk:

[root@liquidcs raidinfo]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md1               73G   21G   49G  30% /
/dev/md0               99M   85M  8.9M  91% /boot
none                  1.3G     0  1.3G   0% /dev/shm

Knowing that I only have a root, boot and swap partition I knew for sure by process of elimination I was dealing with the swap.

So next I wanted to see what the system logs had to say about md2:

[root@host raidinfo]# cat /var/log/message* | grep md2
Jul 22 22:19:36 host kernel: md: created md2
Jul 22 22:19:36 host kernel: md2: removing former faulty sdb2!
Jul 22 22:19:36 host kernel: md2: max total readahead window set to 124k
Jul 22 22:19:36 host kernel: md2: 1 data-disks, max readahead per data-disk: 124k
Jul 22 22:19:36 host kernel: raid1: md2, not all disks are operational -- trying to recover array
Jul 22 22:19:36 host kernel: raid1: raid set md2 active with 1 out of 2 mirrors
Jul 22 22:19:36 host kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode
Jul 22 22:19:36 host kernel: md: md2 already running, cannot run sdb2
Jul 22 22:19:37 host kernel: md: md2 already running, cannot run sdb2
Jul  5 07:01:21 host kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode
Jun  4 06:46:44 host kernel: md: created md2
Jun  4 06:46:44 host kernel: md2: max total readahead window set to 124k
Jun  4 06:46:44 host kernel: md2: 1 data-disks, max readahead per data-disk: 124k
Jun  4 06:46:44 host kernel: raid1: raid set md2 active with 2 out of 2 mirrors

So it had failed a over a month ago, ugh. Again, at least it was swap. I am still not sure what happened though. Restoring it turned out to be a simple task. I just ran raidhotadd to add the partition back to the array and waited about a minute. (This was just a small swap partition. Another guy at the data center chatted with me for a few minutes while I was there as he was waiting 3 more hours for a 500GB RAID array to finish rebuilding.)

[root@host raidinfo]# raidhotadd /dev/md2 /dev/sdb2
[root@host raidinfo]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 5
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[2] sda2[0]
1052160 blocks [2/1] [U_]
[>....................]  recovery =  1.8% (20088/1052160) finish=0.8min speed=20088K/sec
md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]

unused devices: <>

[root@host raidinfo]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 5
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[2] sda2[0]
1052160 blocks [2/1] [U_]
[=================>...]  recovery = 88.3% (930124/1052160) finish=0.1min speed=10303K/sec
md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]

unused devices: <>

And then just over a minute later, we are back in business!!

[root@host raidinfo]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 6
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[1] sda2[0]
1052160 blocks [2/2] [UU]

md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]

unused devices:

Crisis avoided. Thanks to the RAID HowTo guide and this page on recovering from the failure. I really should setup some monitoring and brush up on my RAID knowledge.

This blog is mostly about desktop Linux but I also help manage a handful of Linux server for work. We mostly run CentOS servers or the occasional Fedora on internal sites but use CentOS exclusively for any public sites. I was finally getting around to updating a CentOS 3 server today with good old yum update when I ran into an unwelcome surprise.

Running test transaction:
Errors reported doing trial run
installing package kernel-smp-2.4.21-53.EL needs 53KB on the /boot filesystem

D'oh! I knew this day would come eventually when my tiny little 90MB /boot partition would get full!

So what to do? Easy, just uninstall some of those old kernels. But how do I do that? It turns out, that is also easy.

First I figure out what kernel I am actually running now so I don't try and delete it. I am not sure what would happen if I tried but I am not going to find out. To determine the running kernel, use uname -r. I am running 2.4.21-52.ELsmp .

Next we want to see what other kernels are installed to find what we can delete. Do this like so:

[root@localhost root]# rpm -q kernel
kernel-2.4.21-37.EL
kernel-2.4.21-37.0.1.EL
kernel-2.4.21-40.EL
kernel-2.4.21-47.EL
kernel-2.4.21-47.0.1.EL
kernel-2.4.21-50.EL
kernel-2.4.21-51.EL
kernel-2.4.21-52.EL

I see that we have 7 old kernels installed. Now to delete them, just use rpm -e like so:

rpm -e kernel-2.4.21-37.EL

And it is gone. Rinse and repeat for the other kernels. I only plan on getting rid of a few. This is what was left:

[root@localhost root]# rpm -q kernel
kernel-2.4.21-50.EL
kernel-2.4.21-51.EL
kernel-2.4.21-52.EL

Now I have 30M instead of 5.6M free on my /boot partition and I can continue the update. Unfortunately that didn't clean up /boot/grub/grub.conf to remove the old kernels from the grub boot menu. Maybe that has been fixed in CentOS 4 or 5 (well, RHEL).