My software RAID 1 swap partition failed! - Adventures in Switching to Linux

Tuesday, July 22, 2008

My software RAID 1 swap partition failed!

Tonight I went to do the simple task of adding some more memory to the one production Linux box for which I am fully responsible. (Running CentOS 3.9 currently) It was to be a simple addition of 2x1GB sticks bumping this poor machine from 512MB of RAM. I figured it would be pretty quick and painless. I even had no problems with the normally very temperamental fingerprint scanner at the data center. It worked on the first try. I was off to a good start.

I did have to fight with some cables to get a keyboard and monitor hooked up once I got into the rack but that was expected. It is worse because this particular system isn't as deep as the others above and below it making plugging things in very difficult. What I didn't expect though was what I saw when I got the monitor connected. It appeared that one of the drives in the software RAID 1 I have setup failed.

I don't remember the exact errors but since the system was still running and had been for quite a while, I wasn't really worried about downtime. I was just annoyed that I would have to make another late night or weekend trip out to the data center. So I upgraded the memory and made sure everything else was back up before looking into the RAID failure.

The first thing I checked was /proc/mdstat which is "the current information for multiple-disk, RAID configurations".

[root@host raidinfo]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 3
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sda2[0]
1052160 blocks [2/1] [U_]

md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]

unused devices: <>
So there it is. md2 which is made up of the partitions sda2 and sdb2 is missing sdb2. Fortunately there is just an underscore there meaning it is just not connected. An F would mean it has failed. I was pretty sure that md2 was also the swap partition which is much less of a big deal and makes the most sense. After the upgrade from 512MB of ram to 2.5GB, the box was idling at 754GB so before the upgrade it must have been using that swap partition A LOT!

To double check, I checked to see what was mounted on each virtual disk:
[root@liquidcs raidinfo]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md1 73G 21G 49G 30% /
/dev/md0 99M 85M 8.9M 91% /boot
none 1.3G 0 1.3G 0% /dev/shm
Knowing that I only have a root, boot and swap partition I knew for sure by process of elimination I was dealing with the swap.

So next I wanted to see what the system logs had to say about md2:
[root@host raidinfo]# cat /var/log/message* | grep md2
Jul 22 22:19:36 host kernel: md: created md2
Jul 22 22:19:36 host kernel: md2: removing former faulty sdb2!
Jul 22 22:19:36 host kernel: md2: max total readahead window set to 124k
Jul 22 22:19:36 host kernel: md2: 1 data-disks, max readahead per data-disk: 124k
Jul 22 22:19:36 host kernel: raid1: md2, not all disks are operational -- trying to recover array
Jul 22 22:19:36 host kernel: raid1: raid set md2 active with 1 out of 2 mirrors
Jul 22 22:19:36 host kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode
Jul 22 22:19:36 host kernel: md: md2 already running, cannot run sdb2
Jul 22 22:19:37 host kernel: md: md2 already running, cannot run sdb2
Jul 5 07:01:21 host kernel: md2: no spare disk to reconstruct array! -- continuing in degraded mode
Jun 4 06:46:44 host kernel: md: created md2
Jun 4 06:46:44 host kernel: md2: max total readahead window set to 124k
Jun 4 06:46:44 host kernel: md2: 1 data-disks, max readahead per data-disk: 124k
Jun 4 06:46:44 host kernel: raid1: raid set md2 active with 2 out of 2 mirrors

So it had failed a over a month ago, ugh. Again, at least it was swap. I am still not sure what happened though. Restoring it turned out to be a simple task. I just ran raidhotadd to add the partition back to the array and waited about a minute. (This was just a small swap partition. Another guy at the data center chatted with me for a few minutes while I was there as he was waiting 3 more hours for a 500GB RAID array to finish rebuilding.)

[root@host raidinfo]# raidhotadd /dev/md2 /dev/sdb2
[root@host raidinfo]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 5
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[2] sda2[0]
1052160 blocks [2/1] [U_]
[>....................] recovery = 1.8% (20088/1052160) finish=0.8min speed=20088K/sec
md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]

unused devices: <>

[root@host raidinfo]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 5
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[2] sda2[0]
1052160 blocks [2/1] [U_]
[=================>...] recovery = 88.3% (930124/1052160) finish=0.1min speed=10303K/sec
md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]

unused devices: <>

And then just over a minute later, we are back in business!!

[root@host raidinfo]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 6
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]

md2 : active raid1 sdb2[1] sda2[0]
1052160 blocks [2/2] [UU]

md1 : active raid1 sdb3[1] sda3[0]
76967296 blocks [2/2] [UU]

unused devices:

Crisis avoided. Thanks to the RAID HowTo guide and this page on recovering from the failure. I really should setup some monitoring and brush up on my RAID knowledge.

4 comments:

Unknown said...

Hello,

I'm Susan, of the TechnoSnack's team and I wish to inform you that we are opening a new blog aggregator about Computers & Internet news.
We put it on-line some hours ago and the link is: http://www.technosnack.com.

The main objective of this project is creation of a "virtual dashboard" of posts coming from many specialized blog and information about Computers & Internet world, with news about Linux, Windows, Mac, Open sources, Security, Graphics, Symbian and more on...

The key feature is that news come directly from blogosphere. We wish to show a preview of posts, with a link "Read more..." to signed blogs. If users are interested in news, they are redirected to your blog and can read entire post directly from your blog!

So, the different signed blogs can increase their visibility and reach more visitors, all over the world!

We think that in a little of time it can send more visitors to re gistered blogs, contributing to diffusion of know-how about Computer and Technology world.

I visited your blog and I think it has very interesting and useful posts!

So, are you interested in this idea, with your blog?
If yes, then you can register your blog, using the specific "Registration Form"!

REGISTRATION IS ABSOLUTELY FREE!

The only thing we ask to you is to insert TechnoSNACK banner in your blog to promote this project. Or, if you prefer, you can insert a link in your blogroll.

If you like (we whould be happy, but it is not mandatory :-), you can write a post regarding TechnoSNACK project in your blog, to promote this idea.

Bye!
Susan - TechnoSnack's Team

Unknown said...
This comment has been removed by a blog administrator.
Anonymous said...

Hi.
After serious mdadm RAID failure with 2 x SCSI 9GB now I have e-mail monitoring of all my sw arrays. It happend few years ago.
Sometimes electronics die, sometimes there are bad sectors, sometimes controller is unable to handle it. Just keep an eye on it.

Unknown said...
This comment has been removed by a blog administrator.