Archive for the ‘RAID’ Category

Linux software RAID: Devices renamed after reboot

Tuesday, September 9th, 2014

How to avoid?

Use –metadata=0.90 when creating the array.

Keep in mind though that if you move the array to another machine which already has an array with the same name, one of those will be renamed and AFAIK you can not know in advance (except if you study the autodetection part of Linux kernel.

Linux software RAID: why you should always use RAID 10 instead of RAID 1

Sunday, September 7th, 2014

Short answer – for performance reasons.
Less short answer – for single-threaded performance reasons.

FINAL UPDATE on 2014-09-16

The most elegant solution was finally discovered: do not create 2 partitions on each drive, just create RAID 10 instead of RAID 1, with only 2 devices. Instant single-threaded performance boost, and do not worry about device order ever again:)

For the best performance, far2 layout needs to be used with both types of drives: SSD and SATA. The difference on SSD drives is small when compared to default RAID 10 layout (near2) (820MB/s vs 970MB/s, single disk performs around 500MB/s), but for SATA disks this is crucial (160MB/s vs 320MB/s, single disk performs around 160MB/s).

Thanks to Neil and David Brown for hints and additional explanation. See this Linux RAID mailing list thread for details.

The content of this blog post was adjusted to final discoveries, in order to not mislead anyone stumbling upon it. The original content is preserved below a final conclusion.

END OF FINAL UPDATE

The good side of Linux software RAID

Linux software RAID 1 is a great thing. Personally I prefer it to shabby on-board RAID implementations, as it gives me the following options:

  • no reboot (and physical presence) required when doing changes
  • allows me to see what is going on with disks (easily scripted monitoring via )
  • good performance with almost zero CPU overhead (unless RAID 5 is used)
  • hard to get your data locked-in into some proprietary format

Welcome to the dark side

As always there are those little details that will get you. Linux software RAID has a few:

Bad thing #1: Single-threaded RAID 5 parity calculation

Contrary to common-sense notion that Linux RAID performs great and is usually bound by hardware limits (disk speed, SATA bus speed, PCI(e/-x) bus speed), when using RAID 5 the host’s CPU needs to calculate parity on the fly. If you would be using a good RAID controller, this would be done by the controller itself. If shabby controller is used, this function is usually sneaked into the host CPU by the driver, to reduce hardware costs.

The problem with Linux software RAID 5 parity calculation is that it is single-threaded. Therefore it does not matter how many CPU cores you have, only one will be used to calculate the parity and it might be your bottleneck (it certainly was mine, but the CPU was a bit older, can not tell you which, as the machine is already gone).

There was some work going on to make this parity calculation multi-threaded, but I do not know how far that went.

Bad thing #2: RAID 1 single-threaded client performance – our focus

Logic dictates that if we use RAID 1 that means data is mirrored across at least two devices. When writing it, the speed will be limited by the slowest device, no way around it. But when reading it, there are multiple devices to read from and reads are surely optimized in such manner that we get N*xMB/s performance (N being number of devices and x being speed of the slowest device). Well, no.

If you use Linux software RAID 1, your single-threaded performance will be limited by the throughput of a single underlying device. Only when you have multiple clients accessing MD device simultaneously, will the MD driver utilize more than one device.

Solution for #2: Use RAID 10 with ‘far 2’ layout

There is not much to say besides providing sample commands how to create it, my benchmark results and a nice “go and see for yourself”.
BTW: Does this work with ordinary SATA drives too? Yes.


# SSD
mdadm --create /dev/md16 --level=10 --metadata=0.90 --raid-devices=2 --layout=f2 /dev/sda6 /dev/sdb6

# SATA
mdadm --create /dev/md24 --level=10 --metadata=0.90 --raid-devices=2 --layout=f2 /dev/sdc4 /dev/sdd4

Benchmark results for SSD

Benchmark results for SATA

Hardware used to perform these benchmarks:

  • CPU: i7-3770
  • Memory: 32GB (limited to 1GB for benchmarks)
  • MB: Asus P8H77-M PRO
  • Disks SSD: 2x Intel 530 SSD 240GB, plugged into onboard controller via 6Gbps SATA link
  • Disks SATA: 2x Seagate ST3000DM001-1CH166 3TB, onboard 3Gbps SATA link

Conclusion with a final thought

Use RAID 10. Simple as that.


WHAT YOU SHOULD NOT BE DOING

These are the remainings of original content when I failed to realize that RAID 10 can be created with only two drives. I worked around this by creating multiple partitions on the same drive in order to provide 4+ partitions to RAID 10 creation command.

UPDATE 2014-09-08: – UPDATE-IGNORE: See below, update from 2014-09-10

I noticed these entries in the logs:

The thing is, I can not possibly be sure if RAID 10 correctly places mirrors across two physical devices and stripes across those two mirrors, or does the other way around. Trying to fail the device manually did not work. It did for sdb2, but sdb3 was said to be “device busy”.

Workaround: – UPDATE-IGNORE: See below, update from 2014-09-10

Involves creating two RAID 1 devices, and then creating partition tables with single partition on those two devices, and using those partitions as the basis of RAID 0 device. This is called . If you do it as you would normaly do it, the benefits shown in this article will not be noticeable. What you have to do is create one RAID 1 device with sdaX and sdbX, and then create the other RAID 1 device with sdbY and sdaY. Note the order when creating second device, it is reversed.

This hackish workaround does cause a certain performance penalty. I managed to squeeze out around 830MB/s (was around 900MB/s before) while reading. Writing was around 410 MB/s.

UPDATE 2014-09-10:

It turns out that having nested arrays is possible, but a major PITA for autodetection. I did not even try to have root partition on it (but then I do not use initramfs, and in-kernel autodetection is more or less deprecated and will be probably phased-out in a while (speculation!)).

I went back and retried the solution using straight RAID 10. It turns out that it works, but you must be careful how you create it – ORDER OF PARTITIONS YOU GIVE TO mdadm IS IMPORTANT!

Autodetection

There are two modes of autodetection: in-kernel and initramfs-based. The former is slowly being deprecated in the favour of the later. With initramfs you can do whatever you want and this might be very distribution-specific, therefore we will not talk about it here (shorly: make sure it has proper mdadm.conf and you are set, and probably you should disable in-kernel autodetection, or you will have to stop all autodetected arrays to avoid problems when assembling it again in initramfs).

Aaaaaaanyway, here are the results. In-kernel autodetection of RAID 10 arrays (kernel 3.16.2)
– metadata 0.90: works for non-degraded and degraded devices
– metadata 1.2: works for non-degraded and degraded devices too. Note that devices are renamed at boot, naming starts with /dev/md127 and counts down from there

Does this work with ordinary SATA drives too?

True, the tests were done on SSD disks. However I retested it with SATA disks and the results are:
– single drive sequential read speed: 111 MB/s
– RAID 10 sequential read speed: 94 MB/s

Why? At the time I was stunned. “How can my theory fail?”, I said. Then I watched the behaviour and there were reads going on over all partitions invloved in said RAID 10 array. On SSD this is not an issue, but on SATA those head seeks prevail and kill the expected performance. As an additional observation, I also noticed that initial sync time was significantly longer than expected. At 100MB/s and 20G partitions I expected it to be over in about 3m20s times 2 equals around 7 minutes, but it was going on for about 13 minutes.

To verify the theory from the previous paragraph, I created nested RAID 1+0 array to try to avoid disk seeks. Array rebuild was finished in around 7 minutes, as expected to sync 2x 20GB RAID 1.
But initial results were disappointing – around 60MB/s! WHAT? WHY?

Then I remembered the content of this article and created second RAID 1 device with reverse device order. (Gee, thanks Bostjan, your blog contains a really useful information:).
Result? 216 MB/s sequential read!!!