User Tools

Site Tools


technology:linux:creating_a_4-disk_raid_array

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

technology:linux:creating_a_4-disk_raid_array [2009/09/29 12:23]
Chris Freyer
technology:linux:creating_a_4-disk_raid_array [2011/04/06 08:50] (current)
Chris Freyer
Line 1: Line 1:
 ====== Creating a 4-Disk RAID Array ====== ====== Creating a 4-Disk RAID Array ======
- 
- 
 ===== The Hardware ===== ===== The Hardware =====
 The hardware I'm using is pretty standard stuff.  Its not a gamer PC, but its relatively new technology.  And its very I/O-friendly: The hardware I'm using is pretty standard stuff.  Its not a gamer PC, but its relatively new technology.  And its very I/O-friendly:
-  * Gigabyte [[http://www.gigabyte.com.tw/Products/Motherboard/Products_Overview.aspx?ProductID=2324|GA 945P-S3]] motherboard - no RAID, no hotswap, 1 EIDE port, 4 SATA II ports+  * Gigabyte [[http://www.gigabyte.us/Products/Motherboard/Products_Overview.aspx?ProductID=2958|GA EP45T-UD3LR]] motherboard ([[http://www.newegg.com/Product/Product.aspx?Item=N82E16813128371|Newegg link]]
   * Pentium D dual-core 3.4 Ghz CPU   * Pentium D dual-core 3.4 Ghz CPU
-  * 2GB RAM+  * 4GB RAM
   * 1 80GB EIDE disk   * 1 80GB EIDE disk
   * 4 1TB SATA disks   * 4 1TB SATA disks
Line 16: Line 14:
 ===== Hardware-based vs. Software based ===== ===== Hardware-based vs. Software based =====
  
-The big question when doing RAID is: hardware or software?.  The hardware approach requires an fairly expensive controller card, but it hides the complexity of RAID from the operating system.  The software approach requires a complex setup and a machine that is fast enough to do memory calculations for the data being stored (which hasn't been very fast in the past).    +The big question when doing RAID is: hardware or software?.  The hardware approach requires an fairly expensive controller card, but the software approach requires a complex setup and a fast processor for in-memory checksum calculations.   I did a lot of research and will try to summarize it in a table:
  
-I did a lot of research and found some sad realities.  I'll try to explain those via a table:+^Option                                    ^Hardware RAID     ^Software RAID  ^Fake RAID     ^ 
 +|Has CPU Overhead                          |  No              |  Yes          |   Yes        | 
 +|Requires controller hardware              |  Yes             |  No           |   Yes        | 
 +|Requires OS drivers                       |  No              |  Yes          |   Yes        | 
 +|Platform-independent                      |  Yes             |  No           |   No         | 
 +|Supports all RAID levels                  |  Yes             |  Yes          |   Maybe      | 
 +|Data usable with different h/w or s/w     |  No              |  Yes          |   Maybe      |
  
-^Option                                    ^Software RAID  ^Fake RAID     ^Hardware RAID     ^ +The "fake RAID" column exists for cost-saving reasons.  It was introduced as a "best of both worlds" solution...combining the low cost of Software RAID with the accelerated performance of a specialized disk controller.  But the opposite became true:  hardware advances in the late 1990's made it unnecessary.  Microsoft introduced Software RAID in Windows 2000 Server, and Linux added RAID support in early 2001.   The market for Fake RAID has grown smaller ever since.  
-|Has CPU Overhead                          |  Yes          |   Yes        |  No              | +
-|Requires controller hardware              |  No           |   Yes        |  Yes             | +
-|Requires OS drivers                       |  Yes          |   Yes        |  No              | +
-|Platform-independent                      |  No           |   No         |  Yes             | +
-|Supports all RAID levels                  |  Yes          |   Maybe      |  Yes             | +
-|Data usable with different h/w or s/w     |  Yes          |   Maybe      |  No              |+
  
-The entire "fake RAID" column exists for Windows machines because Windows hasn't had software RAID support until Windows Vista and [[http://support.microsoft.com/kb/323434|Windows Server 2003]].  Hopefully this approach will disappear now...time will tell.  I'm not a fan of that approach (long story), so I'm sticking with a 100% software approach for now, even though I have access to a Fake RAID controller card.  +As for my system--I'm sticking with a pure software RAID solution, even though I have access to a Fake RAID controller card.  
  
 ===== The Software Setup ===== ===== The Software Setup =====
Line 285: Line 283:
  
 </code> </code>
-==== Replacing bad partition/device ====+==== Re-adding a partition ====
 Suppose you forget to update your ''/etc/mdadm/mdadm.conf'' file after changing an array with ''mdadm'', and then you reboot (like I did)?  You'll end up with an array that looks like this... Suppose you forget to update your ''/etc/mdadm/mdadm.conf'' file after changing an array with ''mdadm'', and then you reboot (like I did)?  You'll end up with an array that looks like this...
 <code> <code>
Line 375: Line 373:
 </code> </code>
  
 +==== Replacing a bad device ====
 +Well, it eventually happens to everyone.  One of my RAID drives went bad.  I don't understand why--it wasn't doing anything demanding or different.  But luckily I got this email, thanks to a properly configured ''/etc/mdadm/mdadm.conf'' file:  
 +
 +<code>
 +This is an automatically generated mail message from mdadm
 +running on werewolf
 +
 +A FailSpare event had been detected on md device /dev/md0.
 +
 +It could be related to component device /dev/sdc1.
 +
 +Faithfully yours, etc.
 +
 +P.S. The /proc/mdstat file currently contains the following:
 +
 +Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
 +md0 : active raid5 sdc1[4](F) sdb1[0] sdd1[1] sde1[3]
 +     2930279808 blocks level 5, 128k chunk, algorithm 2 [4/3] [UU_U]
 +     [=>...................]  recovery =  7.6% (75150080/976759936) finish=1514.9min speed=9919K/sec
 +
 +unused devices: <none>
 +</code>
 +
 +The important thing is the ''(F)'' next to the ''sdc1'' partition.  It means the device has failed.   I power cycled the machine and the array came up in "degraded, recovering" status, but it failed after several hours of rebuilding.  After two or three attempts, I decided the drive was bad (or at least bad enough to warrant replacing).  Here are the stepssteps:
 +
 +  - Run ''mdadm --remove /dev/md0 /dev/sdc'' to remove the bad drive from the array
 +  - Replace the faulty drive with a new one
 +  - Use ''fdisk'' as described above to setup the drive like the others
 +  - Run ''mdadm --add /dev/md0 -/dev/sdc1'' to add the new drive to the array
 +
 +After that, ''cat /proc/mdstat'' reported the array was recovering.  It took nearly 6 hours to rebuild the data, but everything went back to normal.  No lost data.
  
 ===== Benchmarking for Performance ===== ===== Benchmarking for Performance =====
Line 430: Line 459:
  
 </code> </code>
 +==== Test Results ====
  
-==== Results ==== +The script produced some really interesting statistics, which I'll summarize here.
- +
-The script produced some statistics, which I'll summarize here.+
 ^  **FILESYSTEM CREATION**  ^^^^ ^  **FILESYSTEM CREATION**  ^^^^
 ^  Measure              ^  EXT3        ^  JFS         ^  XFS          ^ ^  Measure              ^  EXT3        ^  JFS         ^  XFS          ^
-|Elapsed Time           |  16:01.71    |                            +|Elapsed Time           |  16:01.71    |  0:03.93     0:08.42       
-|Faults needing I/O     |    7         |                            +|Faults needing I/O     |    7         |    1         0             
-|# filesystem inputs    |    944       |                            +|# filesystem inputs    |    944       |    128       280           
-|# filesystem outputs   |  92,232,352  |                            +|# filesystem outputs   |  92,232,352  |   781,872     264,760      
-|% CPU use (avg)        |   11%        |                            +|% CPU use (avg)        |   11%        |   29%         1%           
-|# CPU seconds used     |   109.01     |                            |+|# CPU seconds used     |   109.01     |   1.10        0.08         |
  
 +I'm not overly concerned with the time it requires to create a filesystem.  Its an administrative task that I only do when setting up a new drive.  But I had to notice the huge difference between EXT3 and the other formats.  EXT3 took 16 minutes to create the filesystem while the others took just a few seconds.  The number of filesystem outputs was similarly imbalanced.  Not a good start for EXT3.
  
 ^  **IOZONE EXECUTION**  ^^^^ ^  **IOZONE EXECUTION**  ^^^^
 ^  Measure              ^  EXT3        ^  JFS         ^  XFS          ^ ^  Measure              ^  EXT3        ^  JFS         ^  XFS          ^
-|Elapsed Time           |  16:01.71    |                            +|Elapsed Time           |  13:27.74    |  10:23.15     10:57.16     
-|Faults needing I/O     |            |                            +|Faults needing I/O     |            |    3           0           
-|# filesystem inputs    |    944       |                            +|# filesystem inputs    |    576       |   656          992         
-|# filesystem outputs   |  92,232,352  |                            +|# filesystem outputs   |  95,812,576  |  95,845,872    95,812,568  
-|% CPU use (avg)        |   11%        |                            +|% CPU use (avg)        |   29%        |  27%           29%         
-|# CPU seconds used     |   109.01     |              |               |+|# CPU seconds used     |   230.32     |  165.07      |   187.96      | 
 + 
 +IOZONE provides some useful performance statistics for the disks.  The above stats were gathered while it was running (same tests for each filesystem).  EXT3 took longer to run the tests (3 minutes and 2.5 minutes longer), and took more CPU time (65 and 42 seconds more {39% and 26% extra} ).  JFS has a slight advantage over XFX, but EXT3 is in a distant 3rd place. 
 + 
 +^  **5GB FILE CREATION**  ^^^^ 
 +^  Measure              ^  EXT3        ^  JFS         ^  XFS          ^ 
 +|Elapsed Time           |   1:01.54    |  1:07.51     |  00:56.08     | 
 +|Faults needing I/O     |    0         |   0          |  5            | 
 +|# filesystem inputs    |    312       |  1200        |  560          | 
 +|# filesystem outputs   |  9,765,640   |  9,785,920   |  9,765,672    | 
 +|% CPU use (avg)        |   38%        |  20%         |  24%          | 
 +|# CPU seconds used     |   23.88      |  13.72       |  14.00        | 
 + 
 +Creation of multi-gigabyte files will be a routine event on this machine (since it will be recording TV shows daily).  Each filesystem took just over 1 minute to create the file.  As I expected, EXT3 had significantly higher CPU utilization than JFS and XFS (90% and 58% higher, respectively).  The number of CPU seconds used was higher too (74% and 70%, respectively).  These small numbers don't look significant until you think about running a media server and how much disk IO goes on. 
 + 
 +^  **5GB FILE DELETION**  ^^^^ 
 +^  Measure              ^  EXT3        ^  JFS         ^  XFS          ^ 
 +|Elapsed Time            00:00.96    |  00:00.05    |  00:00.06     | 
 +|Faults needing I/O     |    0         |   0          |  2            | 
 +|# filesystem inputs    |    0         |   0          |  320          | 
 +|# filesystem outputs   |    0         |   0          |  0            | 
 +|% CPU use (avg)        |    98%       |   8%         |  0%           | 
 +|# CPU seconds used     |    0.95      |   0.00       |  0.00         | 
 + 
 +File deletion is a big deal when running a media server.  People want to delete a large file (i.e. a recorded program) and immediately be able to continue using their system.  But I've experienced long delays with EXT3 before -- sometimes 10-15 seconds when deleting a file.  The statistics here don't reflect that, but they do indicate a problem.  The elapsed time is 19X and 16x longer with EXT3 than with JFX and XFS.  CPU use and CPU seconds are simlar in nature.   
 + 
 +Obviously, EXT3 is out of the running here, so I'll stop talking about it.  The real decision is between JFS and XFS.  Both have similar statistics, so I decided to search the internet for relevant info.  Here are some sources that swayed my opinion: 
 +  * [[http://www.debian-administration.org/articles/388|This article]] says "Conclusion:  For quick operations on large files, choose JFS or XFS. If you need to minimize CPU usage, prefer JFS." 
 +  * [[wp>Comparison_of_file_systems]] 
 +  * [[http://www.mythtv.org/wiki/RAID|MythTV RAID guide]] 
 +  * [[http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO.html|Linux Software RAID Guide]] 
 + 
 +===== And the winner is... ===== 
 +The winner is:  XFS.  I've been using it for several years on my MythTV box with no issues.  My recorded programs are stored on an LVM volume formatted with XFS.  The volume itself spans 4 drives from different manufacturers((Seagate, Maxtor, and Western Digital)) and with different capacities((200gb, 300gb, 500gb, and 250gb)) and interfaces((EIDE and SATA)).  My recording and playback performance are great, especially when you consider that my back-end machine serves 4 front-ends (one of which is on the back-end machine).  And the file system delete performance is perfect:  about 1 second to delete a recording (normally a 2-6gb file). 
 + 
 +JFS has maturity on its side--it has been used in IBM's AIX for more than 10 years.  It offers good performance, has good recovery tools, and has the stamp of approval from MythTV users.  But I'm going to run it on a RAID system, and there's very little internet knowledge that I could find on that combination.   
 + 
 +In contrast, XFS has format-time options specifically for RAID situations.  There have been [[http://terapix.iap.fr/cplt/oldSite/hard/cluster/raid_config.html|reports]] of 10% CPU savings when you tell XFS about your RAID strip size at format time.  This means more free CPU time for transcoding and other CPU intensive tasks. 
 + 
 +===== RAID Parameter Calculator ===== 
 +Calculating the parameters for a RAID array is a tedious process.  Fortunately someone on the MythTV website had already written a [[http://www.mythtv.org/wiki/Optimizing_Performance#Further_Information|shell script]] to help calculate the the proper values for an array.  I converted that to JavaScript, and I offer it here for your convenience.  If you find any errors or improvements, [[chris@thefreyers.net|please let me know]].   
 + 
 +Note: ''blocksize'' refers to the size (in bytes) of a single chunk of disk space.  In Linux, that can't be larger than the size of a memory page (called ''pagesize'').  So how do you find out your ''pagesize''?  In Ubuntu, you run ''getconf PAGESIZE'' at the command line.  In my case, the value is 4096.  It might be slightly different on other systems. 
 + 
 +<HTML> 
 +<script type="text/javascript"> 
 +function doUpdate() 
 +{  
 + //-----data entry checks----- 
 + var block = document.getElementById('block'); 
 + if (block.value == "" || !isNumeric(block.value)){ 
 + alert ("BLOCK value must be a number."); 
 + block.focus(); 
 + return; 
 +
 + var chunk = document.getElementById('chunk'); 
 + if (chunk.value == "" || !isNumeric(chunk.value)){ 
 + alert ("CHUNK value must be a number."); 
 + chunk.focus(); 
 + return; 
 +
 + var spind = document.getElementById('spind'); 
 + if (spind.value == "" || !isNumeric(spind.value)){ 
 + alert ("# OF SPINDLES must be a number."); 
 + spind.focus(); 
 + return; 
 +
 + var raidtype = document.getElementById('raidtype'); 
 + var r = raidtype.value 
 + if (r == "" || !isNumeric(r) || !(r==0 || r==1 || r==10 || r==5 || r==6)){ 
 + alert ("RAID TYPE must be 0, 1, 10, 5, or 6."); 
 + raidtype.focus(); 
 + return; 
 +
 + var devname = document.getElementById('devname'); 
 + if (devname.value == ""){ 
 + alert ("RAID DEVICE NAME is required."); 
 + devname.focus(); 
 + return; 
 +
 + var fslabel = document.getElementById('fslabel'); 
 + if (fslabel.value == ""){ 
 + alert ("FILE SYSTEM LABEL is required."); 
 + fslabel.focus(); 
 + return; 
 +
 + 
 + //assume we have good values from here on... 
 + var txtarea = document.getElementById('textout'); 
 + var outtext = ""; 
 + var RAIDDISKS = spind.value; 
 + switch (parseInt(raidtype.value,10)) 
 +
 + case 5: RAIDDISKS = spind.value - 1;break; 
 + case 6: RAIDDISKS = spind.value - 2;break; 
 +
 + 
 + var SUNIT=chunk.value*1024/512; 
 + var SWIDTH=RAIDDISKS*SUNIT; 
 + 
 + outtext+="Data you provided:  \n"; 
 + outtext+="Blocksize="+block.value+", "; 
 + outtext+="Chunk Size="+chunk.value+" KiB (software RAID chunk size) "; 
 + outtext+="# Spindles="+spind.value+", "; 
 + outtext+="RAID Type="+raidtype.value+", "; 
 + outtext+="RAID Disks for data="+(spind.value-1)+"\n\n"; 
 + outtext+="Values calculated for you:\n"; 
 + outtext+="    Stripe Unit="+SUNIT+"\n"; 
 + outtext+="    Stripe Width="+SWIDTH+"\n"; 
 + outtext+="Your mkfs line:\n"; 
 + outtext+="    mkfs.xfs -b size="+block.value+" -d sunit="+SUNIT+",swidth="+SWIDTH+" -L "+fslabel.value+" "+devname.value+"\n"; 
 + outtext+="Your mount line:\n"; 
 + outtext+="    mount -o remount,sunit="+SUNIT+",swidth="+SWIDTH+"\n"; 
 + outtext+="Your fstab options:\n"; 
 + outtext+="    sunit="+SUNIT+",swidth="+SWIDTH+"\n"; 
 + 
 + txtarea.value=outtext; 
 + 
 +
 +function isNumeric(sText) 
 +
 + var ValidChars = "0123456789."; 
 + var IsNumber=true; 
 + var Char; 
 + for (i = 0; i < sText.length && IsNumber == true; i++)  
 + {  
 + Char = sText.charAt(i);  
 + if (ValidChars.indexOf(Char) == -1) IsNumber = false; 
 +
 + return IsNumber; 
 +
 + 
 +</script> 
 + <FORM METHOD=POST ACTION="" NAME="form1"> 
 + <TABLE> 
 + <TR><TD>Blocksize</TD><TD><INPUT TYPE="text" NAME="block" id="block" value="4096">bytes</TD></TR> 
 + <TR><TD>Chunksize</TD><TD><INPUT TYPE="text" NAME="chunk" id="chunk" value="128">Kib</TD></TR> 
 + <TR><TD>#of Spindles</TD><TD><INPUT TYPE="text" NAME="spind" id="spind" value="4"></TD></TR> 
 + <TR><TD>RAID Type</TD><TD><INPUT TYPE="text" NAME="raidtype" id="raidtype" value="5"> (0, 1, 10, 5, 6)</TD></TR> 
 + <TR><TD>RAID Device Name</TD><TD><INPUT TYPE="text" NAME="devname" id="devname" value="/dev/md0"></TD></TR> 
 + <TR><TD>File System Label</TD><TD><INPUT TYPE="text" NAME="fslabel" id="fslabel" value="mylabel"></TD></TR> 
 + <TR><TD><INPUT TYPE="button" VALUE="Calculate" onClick="doUpdate();"></TD></TR> 
 + <TR><TD COLSPAN="2"><TEXTAREA NAME="textout" ROWS="15" COLS="80" id="textout"></TEXTAREA></TD></TR> 
 + </FORM> 
 + </TABLE> 
 + 
 +</HTML>
/home/cfreyer/public_html/data/attic/technology/linux/creating_a_4-disk_raid_array.1254241439.txt.gz · Last modified: 2009/09/29 12:23 by Chris Freyer