Short answer – for performance reasons.
Less short answer – for single-threaded performance reasons.
FINAL UPDATE on 2014-09-16
The most elegant solution was finally discovered: do not create 2 partitions on each drive, just create RAID 10 instead of RAID 1, with only 2 devices. Instant single-threaded performance boost, and do not worry about device order ever again:)
For the best performance, far2 layout needs to be used with both types of drives: SSD and SATA. The difference on SSD drives is small when compared to default RAID 10 layout (near2) (820MB/s vs 970MB/s, single disk performs around 500MB/s), but for SATA disks this is crucial (160MB/s vs 320MB/s, single disk performs around 160MB/s).
Thanks to Neil and David Brown for hints and additional explanation. See this Linux RAID mailing list thread for details.
The content of this blog post was adjusted to final discoveries, in order to not mislead anyone stumbling upon it. The original content is preserved below a final conclusion.
END OF FINAL UPDATE
The good side of Linux software RAID
Linux software RAID 1 is a great thing. Personally I prefer it to shabby on-board RAID implementations, as it gives me the following options:
- no reboot (and physical presence) required when doing changes
- allows me to see what is going on with disks (easily scripted monitoring via )
- good performance with almost zero CPU overhead (unless RAID 5 is used)
- hard to get your data locked-in into some proprietary format
Welcome to the dark side
As always there are those little details that will get you. Linux software RAID has a few:
Bad thing #1: Single-threaded RAID 5 parity calculation
Contrary to common-sense notion that Linux RAID performs great and is usually bound by hardware limits (disk speed, SATA bus speed, PCI(e/-x) bus speed), when using RAID 5 the host’s CPU needs to calculate parity on the fly. If you would be using a good RAID controller, this would be done by the controller itself. If shabby controller is used, this function is usually sneaked into the host CPU by the driver, to reduce hardware costs.
The problem with Linux software RAID 5 parity calculation is that it is single-threaded. Therefore it does not matter how many CPU cores you have, only one will be used to calculate the parity and it might be your bottleneck (it certainly was mine, but the CPU was a bit older, can not tell you which, as the machine is already gone).
There was some work going on to make this parity calculation multi-threaded, but I do not know how far that went.
Bad thing #2: RAID 1 single-threaded client performance – our focus
Logic dictates that if we use RAID 1 that means data is mirrored across at least two devices. When writing it, the speed will be limited by the slowest device, no way around it. But when reading it, there are multiple devices to read from and reads are surely optimized in such manner that we get N*xMB/s performance (N being number of devices and x being speed of the slowest device). Well, no.
If you use Linux software RAID 1, your single-threaded performance will be limited by the throughput of a single underlying device. Only when you have multiple clients accessing MD device simultaneously, will the MD driver utilize more than one device.
Solution for #2: Use RAID 10 with ‘far 2’ layout
There is not much to say besides providing sample commands how to create it, my benchmark results and a nice “go and see for yourself”.
BTW: Does this work with ordinary SATA drives too? Yes.
# SSD
mdadm --create /dev/md16 --level=10 --metadata=0.90 --raid-devices=2 --layout=f2 /dev/sda6 /dev/sdb6
# SATA
mdadm --create /dev/md24 --level=10 --metadata=0.90 --raid-devices=2 --layout=f2 /dev/sdc4 /dev/sdd4
Benchmark results for SSD
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# SSD disk 1 - READ dd if=/dev/sda6 of=/dev/null bs=1M count=10000 10485760000 bytes (10 GB) copied, 22.6481 s, 463 MB/s # SSD disk 2 - READ dd if=/dev/sdb6 of=/dev/null bs=1M count=10000 10485760000 bytes (10 GB) copied, 24.9486 s, 420 MB/s # SSD RAID 10 - READ dd if=/dev/md16 of=/dev/null bs=1M count=10000 10485760000 bytes (10 GB) copied, 10.755 s, 975 MB/s # SSD RAID 10 - WRITE dd if=/dev/zero of=/dev/md16 bs=1M count=10000 10485760000 bytes (10 GB) copied, 20.3645 s, 515 MB/s |
Benchmark results for SATA
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# SATA disk 1 - READ dd if=/dev/sdc4 of=/dev/null bs=1M count=5000 5242880000 bytes (5.2 GB) copied, 31.9309 s, 164 MB/s # SATA disk 2 - READ dd if=/dev/sdd4 of=/dev/null bs=1M count=5000 5242880000 bytes (5.2 GB) copied, 31.6805 s, 165 MB/s # SATA RAID 10 - READ dd if=/dev/md24 of=/dev/null bs=1M count=5000 5242880000 bytes (5.2 GB) copied, 16.6258 s, 315 MB/s # SATA RAID 10 - WRITE dd if=/dev/zero of=/dev/md24 bs=1M count=5000 5242880000 bytes (5.2 GB) copied, 35.6997 s, 147 MB/s |
Hardware used to perform these benchmarks:
- CPU: i7-3770
- Memory: 32GB (limited to 1GB for benchmarks)
- MB: Asus P8H77-M PRO
- Disks SSD: 2x Intel 530 SSD 240GB, plugged into onboard controller via 6Gbps SATA link
- Disks SATA: 2x Seagate ST3000DM001-1CH166 3TB, onboard 3Gbps SATA link
Conclusion with a final thought
Use RAID 10. Simple as that.
WHAT YOU SHOULD NOT BE DOING
These are the remainings of original content when I failed to realize that RAID 10 can be created with only two drives. I worked around this by creating multiple partitions on the same drive in order to provide 4+ partitions to RAID 10 creation command.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Creation mdadm --create /dev/md31 --level=10 --name data --raid-disks 4 /dev/sda2 /dev/sdb2 /dev/sda3 /dev/sdb3 mdadm --create /dev/md32 --level=1 --name data --raid-disks 2 /dev/sda4 /dev/sdb4 # Before each benchmark this command was used: echo 3 > /proc/sys/vm/drop_caches # SSD Single-disk read performance: dd if=/dev/sda6 of=/dev/null bs=1M count=5000 5000+0 records in 5000+0 records out 5242880000 bytes (5.2 GB) copied, 12.6332 s, 415 MB/s # RAID 1 read performance: dd if=/dev/md32 of=/dev/null bs=1M count=10000 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 23.9353 s, 438 MB/s # RAID 10 read performance: dd if=/dev/md31 of=/dev/null bs=1M count=10000 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 11.5981 s, 904 MB/s |
UPDATE 2014-09-08: – UPDATE-IGNORE: See below, update from 2014-09-10
I noticed these entries in the logs:
|
1 2 3 |
md11: WARNING: sdb2 appears to be on the same physical disk as sdb3. md11: WARNING: sda2 appears to be on the same physical disk as sda3. True protection against single-disk failure might be compromised. |
The thing is, I can not possibly be sure if RAID 10 correctly places mirrors across two physical devices and stripes across those two mirrors, or does the other way around. Trying to fail the device manually did not work. It did for sdb2, but sdb3 was said to be “device busy”.
Workaround: – UPDATE-IGNORE: See below, update from 2014-09-10
Involves creating two RAID 1 devices, and then creating partition tables with single partition on those two devices, and using those partitions as the basis of RAID 0 device. This is called . If you do it as you would normaly do it, the benefits shown in this article will not be noticeable. What you have to do is create one RAID 1 device with sdaX and sdbX, and then create the other RAID 1 device with sdbY and sdaY. Note the order when creating second device, it is reversed.
This hackish workaround does cause a certain performance penalty. I managed to squeeze out around 830MB/s (was around 900MB/s before) while reading. Writing was around 410 MB/s.
UPDATE 2014-09-10:
It turns out that having nested arrays is possible, but a major PITA for autodetection. I did not even try to have root partition on it (but then I do not use initramfs, and in-kernel autodetection is more or less deprecated and will be probably phased-out in a while (speculation!)).
I went back and retried the solution using straight RAID 10. It turns out that it works, but you must be careful how you create it – ORDER OF PARTITIONS YOU GIVE TO mdadm IS IMPORTANT!
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
### The good way # # This works ok, speed is as expected, I can fail and remove 2 devices # and array can be reassembled with only one physical device. # mdadm --create /dev/md64 --level=10 --metadata=1.2 --raid-devices=4 \ /dev/sda4 /dev/sdb4 \ /dev/sda5 /dev/sdb5 ### The bad way # # This way you can only soft-fail (by mdadm --fail) one partition. If you try to fail # the other one on the same device, it does not happen (device busy). If you stop the array and zero # superblock on the device you just tried and failed to fail (hehe:), and then try to reassemble, # it fails with "assembled from 2 drives - not enough to start the array". # mdadm --create /dev/md66 --level=10 --metadata=1.2 --raid-devices=4 \ /dev/sda4 /dev/sda5 \ /dev/sdb4 /dev/sdb5 ### Output of properly-created RAID-10 array # # Make sure the order of devices is interchanged in the list (something on # sda followed by something on sdb, then followed by sth on sda and so on) # user@host# mdadm --detail /dev/md126 ... Number Major Minor RaidDevice State 0 8 4 0 active sync /dev/sda4 1 8 20 1 active sync /dev/sdb4 2 8 5 2 active sync /dev/sda5 3 8 21 3 active sync /dev/sdb5 # # This order is (and must be if you want you performance back) maintained when # devices are failed, removed and readded. Only numbers are changed, but numbers # under "RaidDevice" stay identical. # Number Major Minor RaidDevice State 0 8 4 0 active sync /dev/sda4 4 8 20 1 active sync /dev/sdb4 2 8 5 2 active sync /dev/sda5 5 8 21 3 active sync /dev/sdb5 |
Autodetection
There are two modes of autodetection: in-kernel and initramfs-based. The former is slowly being deprecated in the favour of the later. With initramfs you can do whatever you want and this might be very distribution-specific, therefore we will not talk about it here (shorly: make sure it has proper mdadm.conf and you are set, and probably you should disable in-kernel autodetection, or you will have to stop all autodetected arrays to avoid problems when assembling it again in initramfs).
Aaaaaaanyway, here are the results. In-kernel autodetection of RAID 10 arrays (kernel 3.16.2)
– metadata 0.90: works for non-degraded and degraded devices
– metadata 1.2: works for non-degraded and degraded devices too. Note that devices are renamed at boot, naming starts with /dev/md127 and counts down from there
Does this work with ordinary SATA drives too?
True, the tests were done on SSD disks. However I retested it with SATA disks and the results are:
– single drive sequential read speed: 111 MB/s
– RAID 10 sequential read speed: 94 MB/s
Why? At the time I was stunned. “How can my theory fail?”, I said. Then I watched the behaviour and there were reads going on over all partitions invloved in said RAID 10 array. On SSD this is not an issue, but on SATA those head seeks prevail and kill the expected performance. As an additional observation, I also noticed that initial sync time was significantly longer than expected. At 100MB/s and 20G partitions I expected it to be over in about 3m20s times 2 equals around 7 minutes, but it was going on for about 13 minutes.
To verify the theory from the previous paragraph, I created nested RAID 1+0 array to try to avoid disk seeks. Array rebuild was finished in around 7 minutes, as expected to sync 2x 20GB RAID 1.
But initial results were disappointing – around 60MB/s! WHAT? WHY?
Then I remembered the content of this article and created second RAID 1 device with reverse device order. (Gee, thanks Bostjan, your blog contains a really useful information:).
Result? 216 MB/s sequential read!!!

[…] disks by default, so I had to create a RAID-1E (mirror) configuration; more info here and here; the “mdadm create” options follow: –level=10 –raid-devices=3 […]
post was informative, nice workaround
two thumbsup
———-
i learned much and i want learn more before i test and create a mini lab here
Listed below is my Desktop wanted to make a mini lab and this is my storage in linux raid10 array setup
1x = OS ubuntu 14.04.3 server amd64
4x = raid10 array f2 = storage for all files
this hardware requirement can be a server or there’s something else need to add ?
i need your comment on this or email
CPU
Intel Core i3 2100
Cores 2
Threads 4
Name Intel Core i3 2100
Code Name Sandy Bridge
Package Socket 1155 LGA
Technology 32nm
Specification Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz
Family 6
RAM
Memory slots
Total memory slots 2
Used memory slots 2
Free memory slots 0
Memory
Type DDR3
Size 4096 MBytes
Channels # Dual
DRAM Frequency 665.0 MHz
Motherboard
Manufacturer ASUSTeK Computer INC.
Model P8H61-M LE (LGA1155)
Version System Version
Chipset Vendor Intel
Chipset Model Sandy Bridge
Chipset Revision 09
Southbridge Vendor Intel
Southbridge Model H61
Southbridge Revision B3
Hard Drives
ST3500413AS ATA Device
Manufacturer Seagate
Form Factor 3.5″
Interface
Heads 16
Cylinders 16383
SATA type SATA-II 3.0Gb/s
Device type Fixed
ATA Standard ATA8-ACS
LBA Size 48-bit LBA
Power On Count 197 times
Power On Time 1508.3 days
Speed, Expressed in Revolutions Per Minute (rpm) 7200
Features S.M.A.R.T., AAM, NCQ
Transfer Mode SATA III
Interface SATA
Capacity 488GB
Real size 500,107,862,016 bytes
RAID Type None
I am not sure, but shouldn’t you use dd with “conv=fsync” to make sure data is on disk before dd computes the throughput?