How to replace soft RAID1 hard drive (Hetzner)

2014-07-10

3 minute read

Out Of Date Warning

This article was published on 10/07/2014, this means the content may be out of date or no longer relevant.
You should verify that the technical information in this article is still up to date before relying upon it for your own purposes.

Running your own metal (unmanaged) means, it is, to some degree, your responsibility to fix, if a hardware failure happens. We have been using Hetzner as a host for Empfehlungsbund.de for almost 2 years now, but already experienced 2 individual failures of a hard drive. Neither was a real problem, because both ran on RAID1 and were able to be easily replaced. This time, I want to document the steps I took, in the hope of saving myself and other customers time in the future.

Disclaimer: In case of problems, I take no responsibility for any damage. If you don't know what to do, take a managed option or ask a real sysadmin.

Receiving DegradedArray Event e-Mails

Normally, you will receive an E-Mail to your admin/root account:

This is an automatically generated mail message from mdadm running on server.example.com.

A DegradedArray event had been detected on md device /dev/md0.

First thing is to log in as root and check, which hard drives and RAID arrays are affected:

$ cat /proc/mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md2 : active raid1 sdb3[1]
723658368 blocks [2/1] [_U]

md1 : active raid1 sdb2[1] sda2[0]
524224 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
8388544 blocks [2/2] [UU]

Things we see:

There are 3 RAIDs (md0, md1 and md2) which running in raid1.
md1 and md0 are run on sda1 and sdb1 and are operational ([UU]).
hard drive sda3 is not visible on md2 anymore, and the hard drive is missing in the array ([_U] denoted by the underscore).

So, for the reset of the guide, we assume: sda is the broken drive, md2 the broken RAID array.

You can get more information about the RAID with:

mdadm --detail /dev/md2

Running a quick smart-check displayed a lot of errors at our case:

smartctl /dev/sda

Preparing change

Backups! Also prepare for a failover if you have the resources. The server has to shut down for at least a couple of minutes. In the worst case, the server might not boot instantly and you have to book with a rescue console.

Remove broken hard drive completely from all Arrays

If only one RAID is broken, removing the hard drive will only work, if you fail it on the other RAID partitions too:

mdadm --manage /dev/md1 --fail /dev/sda2
mdadm --manage /dev/md0 --fail /dev/sda1
# not needed, because md2 failed for us
# mdadm --manage /dev/md2 --fail /dev/sda2

Now you can remove it:

mdadm  /dev/md0 -r /dev/sda1
mdadm  /dev/md1 -r /dev/sda2

Install GRUB

For us, /dev/sda was broken, so we decided to install GRUB boot loader onto sdb:

sudo grub-install /dev/sdb

Seemed to work, because after the change the server came back without problems.

Changing hard drive

Hetzner has a special support form for hard drive change. They ask for 2 things:

A full SMART LOG
The serial number of the broken drive or the serial number of the functional one (if the broken drives serial number can’t be retrieved).

1. SMART log

smartctl -x /dev/sda > smart.log
# Or send yourself a mail if you have sendmail/nullmailer/..
smartctl -x /dev/sda | mail -s 'SMART Log' myself@server.com

2. Serial Number

/sbin/udevadm info --query=property --name=sda | grep ID_SERIAL

## or

hdparm -i /dev/sda | grep SerialNo

3. Do the replacement

Fill out form, make an appointment
Hope the server will come back

After server restart

Copy the boot sector back to the new hard drive:

sfdisk -d /dev/sdb | sfdisk /dev/sda

Put the drive back in the RAID arrays:

mdadm /dev/md0 -a /dev/sda1
mdadm /dev/md1 -a /dev/sda2
mdadm /dev/md2 -a /dev/sda3
grub-mkdevicemap -n

Wait for resync - took 6 hours for us. NERD-Cinema:

watch cat /proc/mdstat

Resources

Image credit: Wikicommons

server guide