User Tools

Site Tools


technology:linux:creating_a_4-disk_raid_array

This is an old revision of the document!


Creating a 4-Disk RAID Array

The Hardware

The hardware I'm using is pretty standard stuff. Its not a gamer PC, but its relatively new technology. And its very I/O-friendly:

  • Gigabyte GA 945P-S3 motherboard - no RAID, no hotswap, 1 EIDE port, 4 SATA II ports
  • Pentium D dual-core 3.4 Ghz CPU
  • 2GB RAM
  • 1 80GB EIDE disk
  • 4 1TB SATA disks
  • RAID drive cage - 4 hot-swap bays, pass-thru data & power connectors, built-in cooling fan
  • Thermaltake 750W power supply

Hard disk connection diagram For the sake of brevity, I'll get right to the I/O setup. The drive configuration is shown to the right. An 80GB EIDE drive is connected to the PATA connector on the motherboard. The BIOS detects it as the first disk, which is perfect for this setup. The four SATA II drives are snapped into a drive cage, and their data connections are plugged into the four SATA ports on the motherboard. My board doesn't support hotswap, so I'll have to power off the system to replace a drive if one fails.

Hardware-based vs. Software based

The big question when doing RAID is: hardware or software?. The hardware approach requires an fairly expensive controller card, but it hides the complexity of RAID from the operating system. The software approach requires a complex setup and a machine that is fast enough to do memory calculations for the data being stored (which hasn't been very fast in the past).

I did a lot of research and found some sad realities. I'll try to explain those via a table:

Option Software RAID Fake RAID Hardware RAID
Has CPU Overhead Yes Yes No
Requires controller hardware No Yes Yes
Requires OS drivers Yes Yes No
Platform-independent No No Yes
Supports all RAID levels Yes Maybe Yes
Data usable with different h/w or s/w Yes Maybe No

The entire “fake RAID” column exists for Windows machines because Windows hasn't had software RAID support until Windows Vista and Windows Server 2003. Hopefully this approach will disappear now…time will tell. I'm not a fan of that approach (long story), so I'm sticking with a 100% software approach for now, even though I have access to a Fake RAID controller card.

The Software Setup

The software setup involves partitioning all the drives, building a raid array, formatting it, and adding it to the system configuration.

Partitioning Drives

Drive partitioning is the process of splitting a disk into different logical sections (kind of like different songs on a CD). It has to be done before the drives are usable by an operating system. Normally this is done by the operating system at install-time, but I wanted to wait until post-install to work configure the installation myself.

I completed my partitioning with cfdisk. Its a little more robust than fdisk, and the man page for fdisk recommended cfdisk for what I was doing (creating partitions for use on Linux). I want to delete any partitions that already exist, and allocate 100% of each disk to a primary partition. I also want to set the partition type to FD–the Linux RAID autodetect partition type.

This is what my console looked like before configuring the first data drive (/dev/sdb).

root@werewolf:~# cfdisk /dev/sdb
                        cfdisk (util-linux-ng 2.14.2)                        
                                                                             
                            Disk Drive: /dev/sdb                             
                    Size: 1000204886016 bytes, 1000.2 GB                     
           Heads: 255   Sectors per Track: 63   Cylinders: 121601            
                                                                             
   Name        Flags     Part Type  FS Type         [Label]        Size (MB) 
 --------------------------------------------------------------------------- 
                          Pri/Log   Free Space                    1000202.28 
                                                                             
                                                                             
    [   Help   ]  [   New    ]  [  Print   ]  [   Quit   ]  [  Units   ]     
    [  Write   ]                                                             
                                                                           

I used the New option and followed the prompts. When finished, my screen looked like the one below. Last thing is to hit the Write command to save settings to disk

                            Disk Drive: /dev/sdb                             
                    Size: 1000204886016 bytes, 1000.2 GB         
           Heads: 255   Sectors per Track: 63   Cylinders: 121601       
                                                                        
   Name        Flags     Part Type  FS Type         [Label]        Size (MB) 
 --------------------------------------------------------------------------- 
   sdb1                   Primary   Linux raid autodetect         1000202.28 
                                                                        

    [ Bootable ]  [  Delete  ]  [   Help   ]  [ Maximize ]  [  Print   ]
    [   Quit   ]  [   Type   ]  [  Units   ]  [  Write   ]

The partition tool seemed concerned that I wasn't marking the partition as bootable. It warned me when I hit Write, and left a message on my screen after exiting cfdisk. In my case, this is OK. I have a different disk (/dev/sda) that is bootable.

root@werewolf:~# cfdisk /dev/sdb
Disk has been changed.
WARNING: If you have created or modified any
DOS 6.x partitions, please see the cfdisk manual
page for additional information.

After repeating the above process for all 4 of my data drives (/dev/sdb, /dev/sdc, dev/sdd, and /dev/sde), I ran the fdisk -l command to see what my system's partitioning looked like:

root@werewolf:~# fdisk -l

Disk /dev/sda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00078cba

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        8733    70147791   83  Linux
/dev/sda2            8734        9729     8000370    5  Extended
/dev/sda5            8734        9729     8000338+  82  Linux swap / Solaris

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00038d83

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1      121601   976760001   fd  Linux raid autodetect

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1      121601   976760001   fd  Linux raid autodetect

Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1               1      121601   976760001   fd  Linux raid autodetect

Disk /dev/sde: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sde1               1      121601   976760001   fd  Linux raid autodetect

Building the RAID Array

On my system, mdadm wasn't installed by default, so I installed it:

root@werewolf:~# apt-get install mdadm

It has a dependency on postfix (bacause mdadm can send email on drive failures), so I had to take a 2-minute detour and configure that on Ubuntu as part of the mdadm install.

The command I need to run is:

mdadm --create /dev/md0 --verbose --chunk=64 --level=raid5 --raid-devices=4 /dev/sd[bcde]1 

but I'm not sure about the chunk size…what's that? According to the Software RAID HowTo, its the amount of data that will be written to a single disk.

Obviously a larger chunk size will minimize the number of disk writes, but it will increase the compute time needed to generate each parity block. There's probably a happy medium in there somewhere, but its going to be affected by the type of data being written to the array. And in my case, I know the array will be used mostly for A/V files (photos, music, and movies), so I'm willing to try out a large chunk size. The HowTo recommends 128K, so I'll go with that instead of the default-recommended 64.

root@werewolf:~# mdadm --create /dev/md0 --verbose --chunk=128 --level=raid5 --raid-devices=4 /dev/sd[bcde]1
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdb1 appears to contain an ext2fs file system
    size=970735624K  mtime=Sat Aug 22 16:31:34 2009
mdadm: /dev/sde1 appears to contain an ext2fs file system
    size=970735624K  mtime=Sat Aug 22 16:31:34 2009
mdadm: size set to 976759936K
Continue creating array? y
mdadm: array /dev/md0 started.

Nice! The command gave my terminal back, rather than locking it up for untold hours. I was afraid I'd have to run it with nohup or put it in a background process. So…aside from the blinking lights on my hard drive bays, how can I tell when my array is created? Lucky for me, there's a file in the /proc directory that I can cat:

root@werewolf:~# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sde1[4] sdd1[2] sdc1[1] sdb1[0]
      2930279808 blocks level 5, 128k chunk, algorithm 2 [4/3] [UUU_]
      [>....................]  recovery =  0.0% (700032/976759936) finish=255.5min speed=63639K/sec

unused devices: <none>

Excellent! Notice the finish parameter…it tells me the array will be built in 255 minutes. After waiting that long, I ran this command and was greeted with a nice statistics page:

root@werewolf:/var/log$ mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Mon Aug 24 02:24:20 2009
     Raid Level : raid5
     Array Size : 2930279808 (2794.53 GiB 3000.61 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Aug 24 11:30:38 2009
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

           UUID : 44062c56:b84201d7:bba6c3a9:4fef2f9c (local to host werewolf)
         Events : 0.4

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       8       65        3      active sync   /dev/sde1

Formatting the Array

The command I need to run is:

mkfs.ext3 -b 4096 -E stride=32,stripe-width=96

This command uses the ext3 filesystem. I got the parameters from a calculator here. It sets the block size to 4096 bytes (only 1024, 2048, and 4096 are available on my system).  The stride and stripe-width total to 128, which matches the chunksize I gave when setting up the array. (stripe-width = how much data to write, stride = how much space to leave blank [for checksumming?] ).

root@werewolf:~# mkfs.ext3 -b 4096 -E stride=32,stripe-width=96 /dev/md0
mke2fs 1.41.4 (27-Jan-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
183148544 inodes, 732569952 blocks
36628497 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
22357 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
    4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
    102400000, 214990848, 512000000, 550731776, 644972544

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 23 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

Adding the Array to System Startup

All the above details are for setting up an array the first time. But so far we haven't told the OS how to reassemble the array at boot time. We do that with a file called /etc/mdadm.conf. The file format is fully explained in the man page for mdadm.conf. In my case, I need to tell it about the 4 partitions that contain data, and about the array itself (raid level, number of devices, etc.).

Below is my /etc/mdadm.conf file. The last two lines contain my email address, and the name of the program that will watch for certain events md-related events and email me if something goes wrong.

DEVICE /dev/sdb1
DEVICE /dev/sdc1
DEVICE /dev/sdd1
DEVICE /dev/sde1

ARRAY /dev/md0
    devices=/dev/sd[bcde]1
    num-devices=4
    level=5

MAILADDR chris@thefreyers.net

PROGRAM /usr/sbin/handle-mdadm-events

Deleting an Array

If you need to completely delete an array (perhaps because you built it with the wrong parameters and want to start over, here's how:

root@werewolf:~# mdadm /dev/md0 --fail /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md0
root@werewolf:~# mdadm /dev/md0 --fail /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md0
root@werewolf:~# mdadm /dev/md0 --fail /dev/sdd1
mdadm: set /dev/sdd1 faulty in /dev/md0
root@werewolf:~# mdadm /dev/md0 --fail /dev/sde1
mdadm: set /dev/sde1 faulty in /dev/md0
root@werewolf:~# mdadm /dev/md0 --remove /dev/sdb1
mdadm: hot removed /dev/sdb1
root@werewolf:~# mdadm /dev/md0 --remove /dev/sdc1
mdadm: hot removed /dev/sdc1
root@werewolf:~# mdadm /dev/md0 --remove /dev/sdd1
mdadm: hot removed /dev/sdd1
root@werewolf:~# mdadm /dev/md0 --remove /dev/sde1
mdadm: hot removed /dev/sde1
root@werewolf:~# mdadm --stop /dev/md0
mdadm: stopped /dev/md0
root@werewolf:~# mdadm --zero-superblock /dev/sdb1
root@werewolf:~# mdadm --zero-superblock /dev/sdc1
root@werewolf:~# mdadm --zero-superblock /dev/sdd1
root@werewolf:~# mdadm --zero-superblock /dev/sde1
cfdisk /dev/sdb  (delete partition, write partition table, then recreate)
cfdisk /dev/sdc  (delete partition, write partition table, then recreate)
cfdisk /dev/sdd  (delete partition, write partition table, then recreate)
cfdisk /dev/sde  (delete partition, write partition table, then recreate)

Replacing a bad partition/device

Suppose you forget to update your /etc/mdadm/mdadm.conf file after changing an array with mdadm, and then you reboot (like I did)? You'll end up with an array that looks like this…

root@werewolf:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Mon Aug 24 02:24:20 2009
     Raid Level : raid5
     Array Size : 2930279808 (2794.53 GiB 3000.61 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Aug 25 20:45:20 2009
          State : clean, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 128K

           UUID : 44062c56:b84201d7:bba6c3a9:4fef2f9c (local to host werewolf)
         Events : 0.24

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       3       0        0        3      removed

This info tells me there are 4 raid partitions on my system, but only 3 are are associated with my /dev/md0 array. They are all active and working, but the array is in “clean, degraded” state (which means it is currently in a ready state with no backlog of work). The fourth partition isn't even part of the array. How do I add the partition back to the array? Its pretty simple.

root@werewolf:~# mdadm --re-add /dev/md0 /dev/sde1
mdadm: re-added /dev/sde1
root@werewolf:~#

That's it…the partition is added back in because it was once a part of the array and mdadm can recover it (i.e. bring it up-to-date on any changes that have been applied since it was disconnected from the array.

So looking at the statistics, I can see the partition is back, and the array status is changed to “clean, degraded, recovering”.

root@werewolf:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Mon Aug 24 02:24:20 2009
     Raid Level : raid5
     Array Size : 2930279808 (2794.53 GiB 3000.61 GB)
  Used Dev Size : 976759936 (931.51 GiB 1000.20 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Aug 25 20:49:29 2009
          State : clean, degraded, recovering
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 128K

 Rebuild Status : 0% complete

           UUID : 44062c56:b84201d7:bba6c3a9:4fef2f9c (local to host werewolf)
         Events : 0.154

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1
       2       8       49        2      active sync   /dev/sdd1
       4       8       65        3      spare rebuilding   /dev/sde1

My last point of interest is to see whats going on *right now* while the array is recovering…

root@werewolf:~# cat /proc/mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sde1[4] sdb1[0] sdc1[1] sdd1[2]
      2930279808 blocks level 5, 128k chunk, algorithm 2 [4/3] [UUU_]
      [========>............]  recovery = 40.1% (392130328/976759936) finish=146.4min speed=66540K/sec

unused devices: <none>

Benchmarking for Performance

One thing I've learned by experience is that you should benchmark a filesystem before you start using it. This isn't such a big deal on regular desktop systems where the I/O load is fairly light. But on I/O-bound servers like a database or a media server, it really matters.

The Test

Wikipedia's Comparison of File Systems led me to 3 candidates for my media server: EXT3, JFS, and XFS. EXT3 is the default filesystem on Linux (as of Summer 2009), and JFS and XFS get really good reviews on various forums. But which one has the best performance on a media server? I decided to perform a common set of tests on all 3 filesystems to found out. I wrote a script that:

  1. formats the RAID array and captures statistics about the process
  2. mounts the array
  3. runs a comprehensive iozone benchmark suite (while collecting statitics)
  4. times the creation of a 5gb file
  5. times the deletion of the 5gb file

This should give me enough data to make an informed decision about the filesystem.

#! /bin/bash
# dotest.sh
# Chris Freyer (chris@thefreyers.net)
# Sept 8, 2009

outputdir=/root/raidtest

export TIME="time:%E, IOfaults:%F, #fs inputs:%I, #fs outputs:%O, CPU:%P, CPU sec:%S, #signals:%k"

# ----------------------------
# EXT3
# ----------------------------
umount /data
/usr/bin/time -o $outputdir/mkfs_ext3.txt mkfs.ext3 -q -b 4096 -E stride=32,stripe-width=96 /dev/md0
mount -t ext3 /dev/md0 /data
/usr/bin/time -o $outputdir/iozone_ext3.txt iozone -a -f /data/testfile.tmp -R -b $outputdir/iozone_ext3.xls
/usr/bin/time -o $outputdir/dd_ext3.txt dd if=/dev/zero of=/data/file.out bs=1MB count=5000
/usr/bin/time -o $outputdir/rm_ext3.txt rm /data/file.out

# ----------------------------
# JFS
# ----------------------------
umount /data
/usr/bin/time -o $outputdir/mkfs_jfs.txt mkfs.jfs -f -q  /dev/md0
mount -t jfs /dev/md0 /data
/usr/bin/time -o $outputdir/iozone_jfs.txt iozone -a -f /data/testfile.tmp -R -b $outputdir/iozone_jfs.xls
/usr/bin/time -o $outputdir/dd_jfs.txt dd if=/dev/zero of=/data/file.out bs=1MB count=5000
/usr/bin/time -o $outputdir/rm_jfs.txt rm /data/file.out

# ----------------------------
# XFS
# ----------------------------
umount /data
/usr/bin/time -o $outputdir/mkfs_xfs.txt mkfs.xfs -f -q  /dev/md0
mount -t xfs /dev/md0 /data
/usr/bin/time -o $outputdir/iozone_xfs.txt iozone -a -f /data/testfile.tmp -R -b $outputdir/iozone_xfs.xls
/usr/bin/time -o $outputdir/dd_xfs.txt dd if=/dev/zero of=/data/file.out bs=1MB count=5000
/usr/bin/time -o $outputdir/rm_xfs.txt rm /data/file.out

Test Results

The script produced some really interesting statistics, which I'll summarize here.

FILESYSTEM CREATION
Measure EXT3 JFS XFS
Elapsed Time 16:01.71 0:03.93 0:08.42
Faults needing I/O 7 1 0
# filesystem inputs 944 128 280
# filesystem outputs 92,232,352 781,872 264,760
% CPU use (avg) 11% 29% 1%
# CPU seconds used 109.01 1.10 0.08
IOZONE EXECUTION
Measure EXT3 JFS XFS
Elapsed Time 13:27.74 10:23.15 10:57.16
Faults needing I/O 3 3 0
# filesystem inputs 576 656 992
# filesystem outputs 95,812,576 95,845,872 95,812,568
% CPU use (avg) 29% 27% 29%
# CPU seconds used 230.32 165.07 187.96
5GB FILE CREATION
Measure EXT3 JFS XFS
Elapsed Time 1:01.54 1:07.51 00:56.08
Faults needing I/O 0 0 5
# filesystem inputs 312 1200 560
# filesystem outputs 9,765,640 9,785,920 9,765,672
% CPU use (avg) 38% 20% 24%
# CPU seconds used 23.88 13.72 14.00
5GB FILE DELETION
Measure EXT3 JFS XFS
Elapsed Time 00:00.96 00:00.05 00:00.06
Faults needing I/O 0 0 2
# filesystem inputs 0 0 320
# filesystem outputs 0 0 0
% CPU use (avg) 98% 8% 0%
# CPU seconds used 0.95 0.00 0.00
/home/cfreyer/public_html/data/attic/technology/linux/creating_a_4-disk_raid_array.1254326357.txt.gz · Last modified: 2009/09/30 11:59 by Chris Freyer