Diff: HowToSoftwareRAID0.4xHOWTO - Waikato Linux Users Group

Differences between current version and previous revision of HowToSoftwareRAID0.4xHOWTO.

Other diffs: Previous Major Revision, Previous Author, or view the Annotated Edit History

Newer page:	version 3	Last edited on Tuesday, October 26, 2004 11:07:56 am	by AristotlePagaltzis
Older page:	version 2	Last edited on Friday, June 7, 2002 1:07:36 am	by perry	Revert

@@ -1,3506 +1 @@

-

-~~Software-RAID HOWTO~~

-

-~~----~~

-

-~~!!!Software-RAID HOWTO~~

-

-~~!!Linas Vepstas, linas@linas.orgv0.54, 21 November 1998~~

-

-~~----~~

-~~''RAID stands for ''Redundant Array of Inexpensive Disks'', and~~

-~~is meant to be a way of creating a fast and reliable disk-drive~~

-~~subsystem out of individual disks. RAID can guard against disk failure, and can also improve performance over that of a single disk drive.~~

-~~This document is a tutorial/HOWTO/FAQ for users of~~

-~~the Linux MD kernel extension, the associated tools, and their use.~~

-~~The MD extension implements RAID-0 (striping), RAID-1 (mirroring),~~

-~~RAID-4 and RAID-5 in software. That is, with MD, no special hardware~~

-~~or disk controllers are required to get many of the benefits of RAID.''~~

-~~----~~

-

-~~; __Preamble__:~~

-

-~~This document is copyrighted and GPL'ed by Linas Vepstas~~

-(

-~~linas@linas.org).~~

-~~Permission to use, copy, distribute this document for any purpose is~~

-~~hereby granted, provided that the author's / editor's name and~~

-~~this notice appear in all copies and/or supporting documents; and~~

-~~that an unmodified version of this document is made freely available.~~

-~~This document is distributed in the hope that it will be useful, but~~

-~~WITHOUT ANY WARRANTY, either expressed or implied. While every effort~~

-~~has been taken to ensure the accuracy of the information documented~~

-~~herein, the author / editor / maintainer assumes NO RESPONSIBILITY~~

-~~for any errors, or for any damages, direct or consequential, as a~~

-~~result of the use of the information documented herein.~~

-

-~~__RAID, although designed to improve system reliability by adding~~

-~~redundancy, can also lead to a false sense of security and confidence~~

-~~when used improperly. This false confidence can lead to even greater~~

-~~disasters. In particular, note that RAID is designed to protect against~~

-~~*disk* failures, and not against *power* failures or *operator*~~

-~~mistakes. Power failures, buggy development kernels, or operator/admin~~

-~~errors can lead to damaged data that it is not recoverable!~~

-~~RAID is *not* a substitute for proper backup of your system.~~

-~~Know what you are doing, test, be knowledgeable and aware!__~~

-

-~~!!1. Introduction~~

-

-~~!!2. Understanding RAID~~

-

-~~!!3. Setup & Installation Considerations~~

-

-~~!!4. Error Recovery~~

-

-~~!!5. Troubleshooting Install Problems~~

-

-~~!!6. Supported Hardware & Software~~

-

-~~!!7. Modifying an Existing Installation~~

-

-~~!!8. Performance, Tools & General Bone-headed Questions~~

-

-~~!!9. High Availability RAID~~

-

-~~!!10. Questions Waiting for Answers~~

-

-~~!!11. Wish List of Enhancements to MD and Related Software~~

-~~----~~

-

-~~!!1. Introduction~~

-

-***#__Q__:

-~~What is RAID?~~

-

-~~__A__:~~

-~~RAID stands for "Redundant Array of Inexpensive Disks",~~

-~~and is meant to be a way of creating a fast and reliable disk-drive~~

-~~subsystem out of individual disks. In the PC world, "I" has come to~~

-~~stand for "Independent", where marketing forces continue to~~

-~~differentiate IDE and SCSI. In it's original meaning, "I" meant~~

-~~"Inexpensive as compared to refrigerator-sized mainframe~~

-~~3380 DASD", monster drives which made nice houses look cheap,~~

-~~and diamond rings look like trinkets.~~

-

-***#

-

-***#__Q__:

-~~What is this document?~~

-

-~~__A__:~~

-~~This document is a tutorial/HOWTO/FAQ for users of the Linux MD~~

-~~kernel extension, the associated tools, and their use.~~

-~~The MD extension implements RAID-0 (striping), RAID-1 (mirroring),~~

-~~RAID-4 and RAID-5 in software. That is, with MD, no special~~

-~~hardware or disk controllers are required to get many of the~~

-~~benefits of RAID.~~

-

-~~This document is __NOT__ an introduction to RAID;~~

-~~you must find this elsewhere.~~

-

-***#

-

-***#__Q__:

-~~What levels of RAID does the Linux kernel implement?~~

-

-~~__A__:~~

-~~Striping (RAID-) and linear concatenation are a part~~

-~~of the stock 2.x series of kernels. This code is~~

-~~of production quality; it is well understood and well~~

-~~maintained. It is being used in some very large USENET~~

-~~news servers.~~

-

-~~RAID-1, RAID-4 & RAID-5 are a part of the 2.1.63 and greater~~

-~~kernels. For earlier 2..x and 2.1.x kernels, patches exist~~

-~~that will provide this function. Don't feel obligated to~~

-upgrade to 2.1.63; upgrading the kernel is hard; it is *much*

-~~easier to patch an earlier kernel. Most of the RAID user~~

-~~community is running 2..x kernels, and that's where most~~

-~~of the historic RAID development has focused. The current~~

-~~snapshots should be considered near-production quality; that~~

-~~is, there are no known bugs but there are some rough edges and~~

-~~untested system setups. There are a large number of people~~

-~~using Software RAID in a production environment.~~

-

-~~RAID-1 hot reconstruction has been recently introduced~~

-~~(August 1997) and should be considered alpha quality.~~

-~~RAID-5 hot reconstruction will be alpha quality any day now.~~

-

-~~A word of caution about the 2.1.x development kernels:~~

-~~these are less than stable in a variety of ways. Some of~~

-~~the newer disk controllers (e.g. the Promise Ultra's) are~~

-~~supported only in the 2.1.x kernels. However, the 2.1.x~~

-~~kernels have seen frequent changes in the block device driver,~~

-~~in the DMA and interrupt code, in the PCI, IDE and SCSI code,~~

-~~and in the disk controller drivers. The combination of~~

-~~these factors, coupled to cheapo hard drives and/or~~

-~~low-quality ribbon cables can lead to considerable~~

-~~heartbreak. The ckraid tool, as well as~~

-~~fsck and mount put considerable stress~~

-~~on the RAID subsystem. This can lead to hard lockups~~

-~~during boot, where even the magic alt-!SysReq key sequence~~

-~~won't save the day. Use caution with the 2.1.x kernels,~~

-~~and expect trouble. Or stick to the 2..34 kernel.~~

-

-***#

-

-***#__Q__:

-~~I'm running an older kernel. Where do I get patches?~~

-

-~~__A__:~~

-~~Software RAID-0 and linear mode are a stock part of~~

-~~all current Linux kernels. Patches for Software RAID-1,4,5~~

-~~are available from~~

-~~http://luthien.nuclecu.unam.mx/~miguel/raid.~~

-~~See also the quasi-mirror~~

-~~ftp://linux.kernel.org/pub/linux/daemons/raid/~~

-~~for patches, tools and other goodies.~~

-

-***#

-

-***#__Q__:

-~~Are there other Linux RAID references?~~

-

-~~__A__:~~

-

-***#*Generic RAID overview:

-~~http://www.dpt.com/uraiddoc.html.~~

-~~***#*~~

-

-***#*General Linux RAID options:

-~~http://linas.org/linux/raid.html.~~

-~~***#*~~

-

-***#*Latest version of this document:

-~~http://linas.org/linux/Software-RAID/Software-RAID.html.~~

-~~***#*~~

-

-***#*Linux-RAID mailing list archive:

-~~http://www.linuxhq.com/lnxlists/.~~

-~~***#*~~

-

-***#*Linux Software RAID Home Page:

-~~http://luthien.nuclecu.unam.mx/~miguel/raid.~~

-~~***#*~~

-

-***#*Linux Software RAID tools:

-~~ftp://linux.kernel.org/pub/linux/daemons/raid/.~~

-~~***#*~~

-

-***#*How to setting up linear/stripped Software RAID:

-~~http://www.ssc.com/lg/issue17/raid.html.~~

-~~***#*~~

-

-***#*Bootable RAID mini-HOWTO:

-~~ftp://ftp.bizsystems.com/pub/raid/bootable-raid.~~

-~~***#*~~

-

-***#*Root RAID HOWTO:

-~~ftp://ftp.bizsystems.com/pub/raid/Root-RAID-HOWTO.~~

-~~***#*~~

-

-***#*Linux RAID-Geschichten:

-~~http://www.infodrom.north.de/~joey/Linux/raid/.~~

-~~***#*~~

-

-***#

-

-***#__Q__:

-~~Who do I blame for this document?~~

-

-~~__A__:~~

-~~Linas Vepstas slapped this thing together.~~

-~~However, most of the information,~~

-~~and some of the words were supplied by~~

-

-***#*Bradley Ward Allen

-<

-~~ulmo@Q.Net>~~

-~~***#*~~

-

-***#*Luca Berra

-<

-~~bluca@comedia.it>~~

-~~***#*~~

-

-***#*Brian Candler

-<

-~~B.Candler@pobox.com>~~

-~~***#*~~

-

-***#*Bohumil Chalupa

-<

-~~bochal@apollo.karlov.mff.cuni.cz>~~

-~~***#*~~

-

-***#*Rob Hagopian

-<

-~~hagopiar@vu.union.edu>~~

-~~***#*~~

-

-***#*Anton Hristozov

-<

-~~anton@intransco.com>~~

-~~***#*~~

-

-***#*Miguel de Icaza

-<

-~~miguel@luthien.nuclecu.unam.mx>~~

-~~***#*~~

-

-***#*Marco Meloni

-<

-~~tonno@stud.unipg.it>~~

-~~***#*~~

-

-***#*Ingo Molnar

-<

-~~mingo@pc7537.hil.siemens.at>~~

-~~***#*~~

-

-***#*Alvin Oga

-<

-~~alvin@planet.fef.com>~~

-~~***#*~~

-

-***#*Gadi Oxman

-<

-~~gadio@netvision.net.il>~~

-~~***#*~~

-

-***#*Vaughan Pratt

-<

-~~pratt@cs.Stanford.EDU>~~

-~~***#*~~

-

-***#*Steven A. Reisman

-<

-~~sar@pressenter.com>~~

-~~***#*~~

-

-***#*Michael Robinton

-<

-~~michael@bzs.org>~~

-~~***#*~~

-

-***#*Martin Schulze

-<

-~~joey@finlandia.infodrom.north.de>~~

-~~***#*~~

-

-***#*Geoff Thompson

-<

-~~geofft@cs.waikato.ac.nz>~~

-~~***#*~~

-

-***#*Edward Welbon

-<

-~~welbon@bga.com>~~

-~~***#*~~

-

-***#*Rod Wilkens

-<

-~~rwilkens@border.net>~~

-~~***#*~~

-

-***#*Johan Wiltink

-<

-~~j.m.wiltink@pi.net>~~

-~~***#*~~

-

-***#*Leonard N. Zubkoff

-<

-~~lnz@dandelion.com>~~

-~~***#*~~

-

-***#*Marc ZYNGIER

-<

-~~zyngier@ufr-info-p7.ibp.fr>~~

-~~***#*~~

-

-~~__Copyrights__~~

-

-~~***#*~~

-

-~~***#*~~

-

-~~***#*~~

-

-***#*By copyright law, additional copyrights are implicitly held

-~~by the contributors listed above.~~

-~~***#*~~

-

-~~Thanks all for being there!~~

-

-***#

-

-~~----~~

-

-~~!!2. Understanding RAID~~

-

-***#__Q__:

-~~What is RAID? Why would I ever use it?~~

-

-~~__A__:~~

-~~RAID is a way of combining multiple disk drives into a single~~

-~~entity to improve performance and/or reliability. There are~~

-~~a variety of different types and implementations of RAID, each~~

-~~with its own advantages and disadvantages. For example, by~~

-~~putting a copy of the same data on two disks (called~~

-~~__disk mirroring__, or RAID level 1), read performance can be~~

-~~improved by reading alternately from each disk in the mirror.~~

-~~On average, each disk is less busy, as it is handling only~~

-~~1/2 the reads (for two disks), or 1/3 (for three disks), etc.~~

-~~In addition, a mirror can improve reliability: if one disk~~

-~~fails, the other disk(s) have a copy of the data. Different~~

-~~ways of combining the disks into one, referred to as~~

-~~__RAID levels__, can provide greater storage efficiency~~

-~~than simple mirroring, or can alter latency (access-time)~~

-~~performance, or throughput (transfer rate) performance, for~~

-~~reading or writing, while still retaining redundancy that~~

-~~is useful for guarding against failures.~~

-

-~~__Although RAID can protect against disk failure, it does~~

-~~not protect against operator and administrator (human)~~

-~~error, or against loss due to programming bugs (possibly~~

-~~due to bugs in the RAID software itself). The net abounds with~~

-~~tragic tales of system administrators who have bungled a RAID~~

-~~installation, and have lost all of their data. RAID is not a~~

-~~substitute for frequent, regularly scheduled backup.__~~

-

-~~RAID can be implemented~~

-~~in hardware, in the form of special disk controllers, or in~~

-~~software, as a kernel module that is layered in between the~~

-~~low-level disk driver, and the file system which sits above it.~~

-~~RAID hardware is always a "disk controller", that is, a device~~

-~~to which one can cable up the disk drives. Usually it comes~~

-~~in the form of an adapter card that will plug into a~~

-~~ISA/EISA/PCI/S-Bus/!MicroChannel slot. However, some RAID~~

-~~controllers are in the form of a box that connects into~~

-~~the cable in between the usual system disk controller, and~~

-~~the disk drives. Small ones may fit into a drive bay; large~~

-~~ones may be built into a storage cabinet with its own drive~~

-~~bays and power supply. The latest RAID hardware used with~~

-~~the latest & fastest CPU will usually provide the best overall~~

-~~performance, although at a significant price. This is because~~

-~~most RAID controllers come with on-board DSP's and memory~~

-~~cache that can off-load a considerable amount of processing~~

-~~from the main CPU, as well as allow high transfer rates into~~

-~~the large controller cache. Old RAID hardware can act as~~

-~~a "de-accelerator" when used with newer CPU's: yesterday's~~

-~~fancy DSP and cache can act as a bottleneck, and it's~~

-~~performance is often beaten by pure-software RAID and new~~

-~~but otherwise plain, run-of-the-mill disk controllers.~~

-~~RAID hardware can offer an advantage over pure-software~~

-~~RAID, if it can makes use of disk-spindle synchronization~~

-~~and its knowledge of the disk-platter position with~~

-~~regard to the disk head, and the desired disk-block.~~

-~~However, most modern (low-cost) disk drives do not offer~~

-~~this information and level of control anyway, and thus,~~

-~~most RAID hardware does not take advantage of it.~~

-~~RAID hardware is usually~~

-~~not compatible across different brands, makes and models:~~

-~~if a RAID controller fails, it must be replaced by another~~

-~~controller of the same type. As of this writing (June 1998),~~

-~~a broad variety of hardware controllers will operate under Linux;~~

-~~however, none of them currently come with configuration~~

-~~and management utilities that run under Linux.~~

-

-~~Software-RAID is a set of kernel modules, together with~~

-~~management utilities that implement RAID purely in software,~~

-~~and require no extraordinary hardware. The Linux RAID subsystem~~

-~~is implemented as a layer in the kernel that sits above the~~

-~~low-level disk drivers (for IDE, SCSI and Paraport drives),~~

-~~and the block-device interface. The filesystem, be it ext2fs,~~

-~~DOS-FAT, or other, sits above the block-device interface.~~

-~~Software-RAID, by its very software nature, tends to be more~~

-~~flexible than a hardware solution. The downside is that it~~

-~~of course requires more CPU cycles and power to run well~~

-~~than a comparable hardware system. Of course, the cost~~

-~~can't be beat. Software RAID has one further important~~

-~~distinguishing feature: it operates on a partition-by-partition~~

-~~basis, where a number of individual disk partitions are~~

-~~ganged together to create a RAID partition. This is in~~

-~~contrast to most hardware RAID solutions, which gang together~~

-~~entire disk drives into an array. With hardware, the fact that~~

-~~there is a RAID array is transparent to the operating system,~~

-~~which tends to simplify management. With software, there~~

-~~are far more configuration options and choices, tending to~~

-~~complicate matters.~~

-

-~~__As of this writing (June 1998), the administration of RAID~~

-~~under Linux is far from trivial, and is best attempted by~~

-~~experienced system administrators. The theory of operation~~

-~~is complex. The system tools require modification to startup~~

-~~scripts. And recovery from disk failure is non-trivial,~~

-~~and prone to human error. RAID is not for the novice,~~

-~~and any benefits it may bring to reliability and performance~~

-~~can be easily outweighed by the extra complexity. Indeed,~~

-~~modern disk drives are incredibly reliable and modern~~

-~~CPU's and controllers are quite powerful. You might more~~

-~~easily obtain the desired reliability and performance levels~~

-~~by purchasing higher-quality and/or faster hardware.__~~

-

-***#

-

-***#__Q__:

-~~What are RAID levels? Why so many? What distinguishes them?~~

-

-~~__A__:~~

-~~The different RAID levels have different performance,~~

-~~redundancy, storage capacity, reliability and cost~~

-~~characteristics. Most, but not all levels of RAID~~

-~~offer redundancy against disk failure. Of those that~~

-~~offer redundancy, RAID-1 and RAID-5 are the most popular.~~

-~~RAID-1 offers better performance, while RAID-5 provides~~

-~~for more efficient use of the available storage space.~~

-~~However, tuning for performance is an entirely different~~

-~~matter, as performance depends strongly on a large variety~~

-~~of factors, from the type of application, to the sizes of~~

-~~stripes, blocks, and files. The more difficult aspects of~~

-~~performance tuning are deferred to a later section of this HOWTO.~~

-

-~~The following describes the different RAID levels in the~~

-~~context of the Linux software RAID implementation.~~

-

-***#*__RAID-linear__

-~~is a simple concatenation of partitions to create~~

-~~a larger virtual partition. It is handy if you have a number~~

-~~small drives, and wish to create a single, large partition.~~

-~~This concatenation offers no redundancy, and in fact~~

-~~decreases the overall reliability: if any one disk~~

-~~fails, the combined partition will fail.~~

-

-~~***#*~~

-

-***#*__RAID-1__ is also referred to as "mirroring".

-~~Two (or more) partitions, all of the same size, each store~~

-~~an exact copy of all data, disk-block by disk-block.~~

-~~Mirroring gives strong protection against disk failure:~~

-~~if one disk fails, there is another with the an exact copy~~

-~~of the same data. Mirroring can also help improve~~

-~~performance in I/O-laden systems, as read requests can~~

-~~be divided up between several disks. Unfortunately,~~

-~~mirroring is also the least efficient in terms of storage:~~

-~~two mirrored partitions can store no more data than a~~

-~~single partition.~~

-

-~~***#*~~

-

-***#*__Striping__ is the underlying concept behind all of

-~~the other RAID levels. A stripe is a contiguous sequence~~

-~~of disk blocks. A stripe may be as short as a single disk~~

-~~block, or may consist of thousands. The RAID drivers~~

-~~split up their component disk partitions into stripes;~~

-~~the different RAID levels differ in how they organize the~~

-~~stripes, and what data they put in them. The interplay~~

-~~between the size of the stripes, the typical size of files~~

-~~in the file system, and their location on the disk is what~~

-~~determines the overall performance of the RAID subsystem.~~

-

-~~***#*~~

-

-***#*__RAID-__ is much like RAID-linear, except that

-~~the component partitions are divided into stripes and~~

-~~then interleaved. Like RAID-linear, the result is a single~~

-~~larger virtual partition. Also like RAID-linear, it offers~~

-~~no redundancy, and therefore decreases overall reliability:~~

-~~a single disk failure will knock out the whole thing.~~

-~~RAID-0 is often claimed to improve performance over the~~

-~~simpler RAID-linear. However, this may or may not be true,~~

-~~depending on the characteristics to the file system, the~~

-~~typical size of the file as compared to the size of the~~

-~~stripe, and the type of workload. The ext2fs~~

-~~file system already scatters files throughout a partition,~~

-~~in an effort to minimize fragmentation. Thus, at the~~

-~~simplest level, any given access may go to one of several~~

-~~disks, and thus, the interleaving of stripes across multiple~~

-~~disks offers no apparent additional advantage. However,~~

-~~there are performance differences, and they are data,~~

-~~workload, and stripe-size dependent.~~

-

-~~***#*~~

-

-***#*__RAID-4__ interleaves stripes like RAID-, but

-~~it requires an additional partition to store parity~~

-~~information. The parity is used to offer redundancy:~~

-~~if any one of the disks fail, the data on the remaining disks~~

-~~can be used to reconstruct the data that was on the failed~~

-~~disk. Given N data disks, and one parity disk, the~~

-~~parity stripe is computed by taking one stripe from each~~

-~~of the data disks, and XOR'ing them together. Thus,~~

-~~the storage capacity of a an (N+1)-disk RAID-4 array~~

-~~is N, which is a lot better than mirroring (N+1) drives,~~

-~~and is almost as good as a RAID-0 setup for large N.~~

-~~Note that for N=1, where there is one data drive, and one~~

-~~parity drive, RAID-4 is a lot like mirroring, in that~~

-~~each of the two disks is a copy of each other. However,~~

-~~RAID-4 does __NOT__ offer the read-performance~~

-~~of mirroring, and offers considerably degraded write~~

-~~performance. In brief, this is because updating the~~

-~~parity requires a read of the old parity, before the new~~

-~~parity can be calculated and written out. In an~~

-~~environment with lots of writes, the parity disk can become~~

-~~a bottleneck, as each write must access the parity disk.~~

-

-~~***#*~~

-

-***#*__RAID-5__ avoids the write-bottleneck of RAID-4

-~~by alternately storing the parity stripe on each of the~~

-~~drives. However, write performance is still not as good~~

-~~as for mirroring, as the parity stripe must still be read~~

-~~and XOR'ed before it is written. Read performance is~~

-~~also not as good as it is for mirroring, as, after all,~~

-~~there is only one copy of the data, not two or more.~~

-~~RAID-5's principle advantage over mirroring is that it~~

-~~offers redundancy and protection against single-drive~~

-~~failure, while offering far more storage capacity when~~

-~~used with three or more drives.~~

-

-~~***#*~~

-

-***#*__RAID-2 and RAID-3__ are seldom used anymore, and

-~~to some degree are have been made obsolete by modern disk~~

-~~technology. RAID-2 is similar to RAID-4, but stores~~

-~~ECC information instead of parity. Since all modern disk~~

-~~drives incorporate ECC under the covers, this offers~~

-~~little additional protection. RAID-2 can offer greater~~

-~~data consistency if power is lost during a write; however,~~

-~~battery backup and a clean shutdown can offer the same~~

-~~benefits. RAID-3 is similar to RAID-4, except that it~~

-~~uses the smallest possible stripe size. As a result, any~~

-~~given read will involve all disks, making overlapping~~

-~~I/O requests difficult/impossible. In order to avoid~~

-~~delay due to rotational latency, RAID-3 requires that~~

-~~all disk drive spindles be synchronized. Most modern~~

-~~disk drives lack spindle-synchronization ability, or,~~

-~~if capable of it, lack the needed connectors, cables,~~

-~~and manufacturer documentation. Neither RAID-2 nor RAID-3~~

-~~are supported by the Linux Software-RAID drivers.~~

-

-~~***#*~~

-

-***#*__Other RAID levels__ have been defined by various

-~~researchers and vendors. Many of these represent the~~

-~~layering of one type of raid on top of another. Some~~

-~~require special hardware, and others are protected by~~

-~~patent. There is no commonly accepted naming scheme~~

-~~for these other levels. Sometime the advantages of these~~

-~~other systems are minor, or at least not apparent~~

-~~until the system is highly stressed. Except for the~~

-~~layering of RAID-1 over RAID-/linear, Linux Software~~

-~~RAID does not support any of the other variations.~~

-

-~~***#*~~

-

-***#

-

-~~----~~

-

-~~!!3. Setup & Installation Considerations~~

-

-***#__Q__:

-~~What is the best way to configure Software RAID?~~

-

-~~__A__:~~

-~~I keep rediscovering that file-system planning is one~~

-~~of the more difficult Unix configuration tasks.~~

-~~To answer your question, I can describe what we did.~~

-~~We planned the following setup:~~

-

-***#*two EIDE disks, 2.1.gig each.

-

-~~disk partition mount pt. size device~~

-~~1 1 / 300M /dev/hda1~~

-~~1 2 swap 64M /dev/hda2~~

-~~1 3 /home 800M /dev/hda3~~

-~~1 4 /var 900M /dev/hda4~~

-~~2 1 /root 300M /dev/hdc1~~

-~~2 2 swap 64M /dev/hdc2~~

-~~2 3 /home 800M /dev/hdc3~~

-~~2 4 /var 900M /dev/hdc4~~

-

-~~***#*~~

-

-***#*Each disk is on a separate controller (& ribbon cable).

-~~The theory is that a controller failure and/or~~

-~~ribbon failure won't disable both disks.~~

-~~Also, we might possibly get a performance boost~~

-~~from parallel operations over two controllers/cables.~~

-

-~~***#*~~

-

-***#*Install the Linux kernel on the root (/)

-~~partition /dev/hda1. Mark this partition as~~

-~~bootable.~~

-

-~~***#*~~

-

-***#*/dev/hdc1 will contain a ``cold'' copy of

-~~/dev/hda1. This is NOT a raid copy,~~

-~~just a plain old copy-copy. It's there just in~~

-~~case the first disk fails; we can use a rescue disk,~~

-~~mark /dev/hdc1 as bootable, and use that to~~

-~~keep going without having to reinstall the system.~~

-~~You may even want to put /dev/hdc1's copy~~

-~~of the kernel into LILO to simplify booting in case of~~

-~~failure.~~

-~~The theory here is that in case of severe failure,~~

-~~I can still boot the system without worrying about~~

-~~raid superblock-corruption or other raid failure modes~~

-~~& gotchas that I don't understand.~~

-

-~~***#*~~

-

-***#*/dev/hda3 and /dev/hdc3 will be mirrors

-~~/dev/md0.~~

-~~***#*~~

-

-***#*/dev/hda4 and /dev/hdc4 will be mirrors

-~~/dev/md1.~~

-

-~~***#*~~

-

-***#*we picked /var and /home to be mirrored,

-~~and in separate partitions, using the following logic:~~

-

-***#**/ (the root partition) will contain

-~~relatively static, non-changing data:~~

-~~for all practical purposes, it will be~~

-~~read-only without actually being marked &~~

-~~mounted read-only.~~

-~~***#**~~

-

-***#**/home will contain ''slowly'' changing

-~~data.~~

-~~***#**~~

-

-***#**/var will contain rapidly changing data,

-~~including mail spools, database contents and~~

-~~web server logs.~~

-~~***#**~~

-

-~~The idea behind using multiple, distinct partitions is~~

-~~that __if__, for some bizarre reason,~~

-~~whether it is human error, power loss, or an operating~~

-~~system gone wild, corruption is limited to one partition.~~

-~~In one typical case, power is lost while the~~

-~~system is writing to disk. This will almost certainly~~

-~~lead to a corrupted filesystem, which will be repaired~~

-~~by fsck during the next boot. Although~~

-~~fsck does it's best to make the repairs~~

-~~without creating additional damage during those repairs,~~

-~~it can be comforting to know that any such damage has been~~

-~~limited to one partition. In another typical case,~~

-~~the sysadmin makes a mistake during rescue operations,~~

-~~leading to erased or destroyed data. Partitions can~~

-~~help limit the repercussions of the operator's errors.~~

-~~***#*~~

-

-***#*Other reasonable choices for partitions might be

-~~/usr or /opt. In fact, /opt~~

-~~and /home make great choices for RAID-5~~

-~~partitions, if we had more disks. A word of caution:~~

-~~__DO NOT__ put /usr in a RAID-5~~

-~~partition. If a serious fault occurs, you may find~~

-~~that you cannot mount /usr, and that~~

-~~you want some of the tools on it (e.g. the networking~~

-~~tools, or the compiler.) With RAID-1, if a fault has~~

-~~occurred, and you can't get RAID to work, you can at~~

-~~least mount one of the two mirrors. You can't do this~~

-~~with any of the other RAID levels (RAID-5, striping, or~~

-~~linear append).~~

-

-~~***#*~~

-

-~~So, to complete the answer to the question:~~

-

-***#*install the OS on disk 1, partition 1.

-~~do NOT mount any of the other partitions.~~

-~~***#*~~

-

-***#*install RAID per instructions.

-~~***#*~~

-

-***#*configure md0 and md1.

-~~***#*~~

-

-***#*convince yourself that you know

-~~what to do in case of a disk failure!~~

-~~Discover sysadmin mistakes now,~~

-~~and not during an actual crisis.~~

-~~Experiment!~~

-~~(we turned off power during disk activity mdash~~

-~~this proved to be ugly but informative).~~

-~~***#*~~

-

-***#*do some ugly mount/copy/unmount/rename/reboot scheme to

-~~move /var over to the /dev/md1.~~

-~~Done carefully, this is not dangerous.~~

-~~***#*~~

-

-***#*enjoy!

-~~***#*~~

-

-***#

-

-***#__Q__:

-~~What is the difference between the mdadd, mdrun,~~

-~~''etc.'' commands, and the raidadd, raidrun~~

-~~commands?~~

-

-~~__A__:~~

-~~The names of the tools have changed as of the .5 release of the~~

-~~raidtools package. The md naming convention was used~~

-~~in the .43 and older versions, while raid is used in~~

-~~.5 and newer versions.~~

-

-***#

-

-***#__Q__:

-~~I want to run RAID-linear/RAID-0 in the stock 2..34 kernel.~~

-~~I don't want to apply the raid patches, since these are not~~

-~~needed for RAID-/linear. Where can I get the raid-tools~~

-~~to manage this?~~

-

-~~__A__:~~

-~~This is a tough question, indeed, as the newest raid tools~~

-~~package needs to have the RAID-1,4,5 kernel patches installed~~

-~~in order to compile. I am not aware of any pre-compiled, binary~~

-~~version of the raid tools that is available at this time.~~

-~~However, experiments show that the raid-tools binaries, when~~

-~~compiled against kernel 2.1.100, seem to work just fine~~

-~~in creating a RAID-/linear partition under 2..34. A brave~~

-~~soul has asked for these, and I've __temporarily__~~

-~~placed the binaries mdadd, mdcreate, etc.~~

-~~at http://linas.org/linux/Software-RAID/~~

-~~You must get the man pages, etc. from the usual raid-tools~~

-~~package.~~

-

-***#

-

-***#__Q__:

-~~Can I strip/mirror the root partition (/)?~~

-~~Why can't I boot Linux directly from the md disks?~~

-

-~~__A__:~~

-~~Both LILO and Loadlin need an non-stripped/mirrored partition~~

-~~to read the kernel image from. If you want to strip/mirror~~

-~~the root partition (/),~~

-~~then you'll want to create an unstriped/mirrored partition~~

-~~to hold the kernel(s).~~

-~~Typically, this partition is named /boot.~~

-~~Then you either use the initial ramdisk support (initrd),~~

-~~or patches from Harald Hoyer~~

-<

-~~HarryH@Royal.Net>~~

-~~that allow a stripped partition to be used as the root~~

-~~device. (These patches are now a standard part of recent~~

-~~2.1.x kernels)~~

-

-~~There are several approaches that can be used.~~

-~~One approach is documented in detail in the~~

-~~Bootable RAID mini-HOWTO:~~

-~~ftp://ftp.bizsystems.com/pub/raid/bootable-raid.~~

-

-~~Alternately, use mkinitrd to build the ramdisk image,~~

-~~see below.~~

-

-~~Edward Welbon~~

-<

-~~welbon@bga.com>~~

-~~writes:~~

-

-***#*... all that is needed is a script to manage the boot setup.

-~~To mount an md filesystem as root,~~

-~~the main thing is to build an initial file system image~~

-~~that has the needed modules and md tools to start md.~~

-~~I have a simple script that does this.~~

-~~***#*~~

-

-***#*For boot media, I have a small __cheap__ SCSI disk

-~~(170MB I got it used for $20).~~

-~~This disk runs on a AHA1452, but it could just as well be an~~

-~~inexpensive IDE disk on the native IDE.~~

-~~The disk need not be very fast since it is mainly for boot.~~

-~~***#*~~

-

-***#*This disk has a small file system which contains the kernel and

-~~the file system image for initrd.~~

-~~The initial file system image has just enough stuff to allow me~~

-~~to load the raid SCSI device driver module and start the~~

-~~raid partition that will become root.~~

-~~I then do an~~

-

-~~echo 0x900 > /proc/sys/kernel/real-root-dev~~

-

-~~(0x900 is for /dev/md0)~~

-~~and exit linuxrc.~~

-~~The boot proceeds normally from there.~~

-~~***#*~~

-

-***#*I have built most support as a module except for the AHA1452

-~~driver that brings in the initrd filesystem.~~

-~~So I have a fairly small kernel. The method is perfectly~~

-~~reliable, I have been doing this since before 2.1.26 and~~

-~~have never had a problem that I could not easily recover from.~~

-~~The file systems even survived several 2.1.4~~ [~~[45] hard~~

-~~crashes with no real problems~~ .

-~~***#*~~

-

-***#*At one time I had partitioned the raid disks so that the initial

-~~cylinders of the first raid disk held the kernel and the initial~~

-~~cylinders of the second raid disk hold the initial file system~~

-~~image, instead I made the initial cylinders of the raid disks~~

-~~swap since they are the fastest cylinders~~

-~~(why waste them on boot?).~~

-~~***#*~~

-

-***#*The nice thing about having an inexpensive device dedicated to

-~~boot is that it is easy to boot from and can also serve as~~

-~~a rescue disk if necessary. If you are interested,~~

-~~you can take a look at the script that builds my initial~~

-~~ram disk image and then runs LILO.~~

-

-~~http://www.realtime.net/~welbon/initrd.md.tar.gz~~

-~~It is current enough to show the picture.~~

-~~It isn't especially pretty and it could certainly build~~

-~~a much smaller filesystem image for the initial ram disk.~~

-~~It would be easy to a make it more efficient.~~

-~~But it uses LILO as is.~~

-~~If you make any improvements, please forward a copy to me. 8-)~~

-~~***#*~~

-

-***#

-

-***#__Q__:

-~~I have heard that I can run mirroring over striping. Is this true?~~

-~~Can I run mirroring over the loopback device?~~

-

-~~__A__:~~

-~~Yes, but not the reverse. That is, you can put a stripe over~~

-~~several disks, and then build a mirror on top of this. However,~~

-~~striping cannot be put on top of mirroring.~~

-

-~~A brief technical explanation is that the linear and stripe~~

-~~personalities use the ll_rw_blk routine for access.~~

-~~The ll_rw_blk routine~~

-~~maps disk devices and sectors, not blocks. Block devices can be~~

-~~layered one on top of the other; but devices that do raw, low-level~~

-~~disk accesses, such as ll_rw_blk, cannot.~~

-

-~~Currently (November 1997) RAID cannot be run over the~~

-~~loopback devices, although this should be fixed shortly.~~

-

-***#

-

-***#__Q__:

-~~I have two small disks and three larger disks. Can I~~

-~~concatenate the two smaller disks with RAID-, and then create~~

-~~a RAID-5 out of that and the larger disks?~~

-

-~~__A__:~~

-~~Currently (November 1997), for a RAID-5 array, no.~~

-~~Currently, one can do this only for a RAID-1 on top of the~~

-~~concatenated drives.~~

-

-***#

-

-***#__Q__:

-~~What is the difference between RAID-1 and RAID-5 for a two-disk~~

-~~configuration (i.e. the difference between a RAID-1 array built~~

-~~out of two disks, and a RAID-5 array built out of two disks)?~~

-

-~~__A__:~~

-~~There is no difference in storage capacity. Nor can disks be~~

-~~added to either array to increase capacity (see the question below for~~

-~~details).~~

-

-~~RAID-1 offers a performance advantage for reads: the RAID-1~~

-~~driver uses distributed-read technology to simultaneously read~~

-~~two sectors, one from each drive, thus doubling read performance.~~

-

-~~The RAID-5 driver, although it contains many optimizations, does not~~

-~~currently (September 1997) realize that the parity disk is actually~~

-~~a mirrored copy of the data disk. Thus, it serializes data reads.~~

-

-***#

-

-***#__Q__:

-~~How can I guard against a two-disk failure?~~

-

-~~__A__:~~

-~~Some of the RAID algorithms do guard against multiple disk~~

-~~failures, but these are not currently implemented for Linux.~~

-~~However, the Linux Software RAID can guard against multiple~~

-~~disk failures by layering an array on top of an array. For~~

-~~example, nine disks can be used to create three raid-5 arrays.~~

-~~Then these three arrays can in turn be hooked together into~~

-~~a single RAID-5 array on top. In fact, this kind of a~~

-~~configuration will guard against a three-disk failure. Note that~~

-~~a large amount of disk space is ''wasted'' on the redundancy~~

-~~information.~~

-

-~~For an NxN raid-5 array,~~

-~~N=3, 5 out of 9 disks are used for parity (=55%)~~

-~~N=4, 7 out of 16 disks~~

-~~N=5, 9 out of 25 disks~~

-~~...~~

-~~N=9, 17 out of 81 disks (=~20%)~~

-

-~~In general, an MxN array will use M+N-1 disks for parity.~~

-~~The least amount of space is "wasted" when M=N.~~

-

-~~Another alternative is to create a RAID-1 array with~~

-~~three disks. Note that since all three disks contain~~

-~~identical data, that 2/3's of the space is ''wasted''.~~

-

-***#

-

-***#__Q__:

-~~I'd like to understand how it'd be possible to have something~~

-~~like fsck: if the partition hasn't been cleanly unmounted,~~

-~~fsck runs and fixes the filesystem by itself more than~~

-~~90% of the time. Since the machine is capable of fixing it~~

-~~by itself with ckraid --fix, why not make it automatic?~~

-

-~~__A__:~~

-~~This can be done by adding lines like the following to~~

-~~/etc/rc.d/rc.sysinit:~~

-

-~~mdadd /dev/md0 /dev/hda1 /dev/hdc1 || {~~

-~~ckraid --fix /etc/raid.usr.conf~~

-~~mdadd /dev/md0 /dev/hda1 /dev/hdc1~~

-}

-

-or

-

-~~mdrun -p1 /dev/md0~~

-~~if [[ $? -gt 0~~ ] ~~; then~~

-~~ckraid --fix /etc/raid1.conf~~

-~~mdrun -p1 /dev/md0~~

-fi

-

-~~Before presenting a more complete and reliable script,~~

-~~lets review the theory of operation.~~

-~~Gadi Oxman writes:~~

-~~In an unclean shutdown, Linux might be in one of the following states:~~

-

-***#*The in-memory disk cache was in sync with the RAID set when

-~~the unclean shutdown occurred; no data was lost.~~

-

-~~***#*~~

-

-***#*The in-memory disk cache was newer than the RAID set contents

-~~when the crash occurred; this results in a corrupted filesystem~~

-~~and potentially in data loss.~~

-~~This state can be further divided to the following two states:~~

-

-***#**Linux was writing data when the unclean shutdown occurred.

-~~***#**~~

-

-***#**Linux was not writing data when the crash occurred.

-~~***#**~~

-

-~~***#*~~

-

-~~Suppose we were using a RAID-1 array. In (2a), it might happen that~~

-~~before the crash, a small number of data blocks were successfully~~

-~~written only to some of the mirrors, so that on the next reboot,~~

-~~the mirrors will no longer contain the same data.~~

-~~If we were to ignore the mirror differences, the raidtools-.36.3~~

-~~read-balancing code~~

-~~might choose to read the above data blocks from any of the mirrors,~~

-~~which will result in inconsistent behavior (for example, the output~~

-~~of e2fsck -n /dev/md0 can differ from run to run).~~

-

-~~Since RAID doesn't protect against unclean shutdowns, usually~~

-~~there isn't any ''obviously correct'' way to fix the mirror~~

-~~differences and the filesystem corruption.~~

-

-~~For example, by default ckraid --fix will choose~~

-~~the first operational mirror and update the other mirrors~~

-~~with its contents. However, depending on the exact timing~~

-~~at the crash, the data on another mirror might be more recent,~~

-~~and we might want to use it as the source~~

-~~mirror instead, or perhaps use another method for recovery.~~

-

-~~The following script provides one of the more robust~~

-~~boot-up sequences. In particular, it guards against~~

-~~long, repeated ckraid's in the presence~~

-~~of uncooperative disks, controllers, or controller device~~

-~~drivers. Modify it to reflect your config,~~

-~~and copy it to rc.raid.init. Then invoke~~

-~~rc.raid.init after the root partition has been~~

-~~fsck'ed and mounted rw, but before the remaining partitions~~

-~~are fsck'ed. Make sure the current directory is in the search~~

-~~path.~~

-

-~~mdadd /dev/md0 /dev/hda1 /dev/hdc1 || {~~

-~~rm -f /fastboot # force an fsck to occur~~

-~~ckraid --fix /etc/raid.usr.conf~~

-~~mdadd /dev/md0 /dev/hda1 /dev/hdc1~~

-}

-~~# if a crash occurs later in the boot process,~~

-~~# we at least want to leave this md in a clean state.~~

-~~/sbin/mdstop /dev/md0~~

-~~mdadd /dev/md1 /dev/hda2 /dev/hdc2 || {~~

-~~rm -f /fastboot # force an fsck to occur~~

-~~ckraid --fix /etc/raid.home.conf~~

-~~mdadd /dev/md1 /dev/hda2 /dev/hdc2~~

-}

-~~# if a crash occurs later in the boot process,~~

-~~# we at least want to leave this md in a clean state.~~

-~~/sbin/mdstop /dev/md1~~

-~~mdadd /dev/md0 /dev/hda1 /dev/hdc1~~

-~~mdrun -p1 /dev/md0~~

-~~if [[ $? -gt 0 ] ; then~~

-~~rm -f /fastboot # force an fsck to occur~~

-~~ckraid --fix /etc/raid.usr.conf~~

-~~mdrun -p1 /dev/md0~~

-fi

-~~# if a crash occurs later in the boot process,~~

-~~# we at least want to leave this md in a clean state.~~

-~~/sbin/mdstop /dev/md0~~

-~~mdadd /dev/md1 /dev/hda2 /dev/hdc2~~

-~~mdrun -p1 /dev/md1~~

-~~if [[ $? -gt 0 ] ; then~~

-~~rm -f /fastboot # force an fsck to occur~~

-~~ckraid --fix /etc/raid.home.conf~~

-~~mdrun -p1 /dev/md1~~

-fi

-~~# if a crash occurs later in the boot process,~~

-~~# we at least want to leave this md in a clean state.~~

-~~/sbin/mdstop /dev/md1~~

-~~# OK, just blast through the md commands now. If there were~~

-~~# errors, the above checks should have fixed things up.~~

-~~/sbin/mdadd /dev/md0 /dev/hda1 /dev/hdc1~~

-~~/sbin/mdrun -p1 /dev/md0~~

-~~/sbin/mdadd /dev/md12 /dev/hda2 /dev/hdc2~~

-~~/sbin/mdrun -p1 /dev/md1~~

-

-~~In addition to the above, you'll want to create a~~

-~~rc.raid.halt which should look like the following:~~

-

-~~/sbin/mdstop /dev/md0~~

-~~/sbin/mdstop /dev/md1~~

-

-~~Be sure to modify both rc.sysinit and~~

-~~init.d/halt to include this everywhere that~~

-~~filesystems get unmounted before a halt/reboot. (Note~~

-~~that rc.sysinit unmounts and reboots if fsck~~

-~~returned with an error.)~~

-

-***#

-

-***#__Q__:

-~~Can I set up one-half of a RAID-1 mirror with the one disk I have~~

-~~now, and then later get the other disk and just drop it in?~~

-

-~~__A__:~~

-~~With the current tools, no, not in any easy way. In particular,~~

-~~you cannot just copy the contents of one disk onto another,~~

-~~and then pair them up. This is because the RAID drivers~~

-~~use glob of space at the end of the partition to store the~~

-~~superblock. This decreases the amount of space available to~~

-~~the file system slightly; if you just naively try to force~~

-~~a RAID-1 arrangement onto a partition with an existing~~

-~~filesystem, the~~

-~~raid superblock will overwrite a portion of the file system~~

-~~and mangle data. Since the ext2fs filesystem scatters~~

-~~files randomly throughput the partition (in order to avoid~~

-~~fragmentation), there is a very good chance that some file will~~

-~~land at the very end of a partition long before the disk is~~

-~~full.~~

-

-~~If you are clever, I suppose you can calculate how much room~~

-~~the RAID superblock will need, and make your filesystem~~

-~~slightly smaller, leaving room for it when you add it later.~~

-~~But then, if you are this clever, you should also be able to~~

-~~modify the tools to do this automatically for you.~~

-~~(The tools are not terribly complex).~~

-

-~~__Note:__A careful reader has pointed out that the~~

-~~following trick may work; I have not tried or verified this:~~

-~~Do the mkraid with /dev/null as one of the~~

-~~devices. Then mdadd -r with only the single, true~~

-~~disk (do not mdadd /dev/null). The mkraid~~

-~~should have successfully built the raid array, while the~~

-~~mdadd step just forces the system to run in "degraded" mode,~~

-~~as if one of the disks had failed.~~

-

-***#

-

-~~----~~

-

-~~!!4. Error Recovery~~

-

-***#__Q__:

-~~I have a RAID-1 (mirroring) setup, and lost power~~

-~~while there was disk activity. Now what do I do?~~

-

-~~__A__:~~

-~~The redundancy of RAID levels is designed to protect against a~~

-~~__disk__ failure, not against a __power__ failure.~~

-~~There are several ways to recover from this situation.~~

-

-***#*Method (1): Use the raid tools. These can be used to sync

-~~the raid arrays. They do not fix file-system damage; after~~

-~~the raid arrays are sync'ed, then the file-system still has~~

-~~to be fixed with fsck. Raid arrays can be checked with~~

-~~ckraid /etc/raid1.conf (for RAID-1, else,~~

-~~/etc/raid5.conf, etc.)~~

-~~Calling ckraid /etc/raid1.conf --fix will pick one of the~~

-~~disks in the array (usually the first), and use that as the~~

-~~master copy, and copy its blocks to the others in the mirror.~~

-~~To designate which of the disks should be used as the master,~~

-~~you can use the --force-source flag: for example,~~

-~~ckraid /etc/raid1.conf --fix --force-source /dev/hdc3~~

-~~The ckraid command can be safely run without the --fix~~

-~~option~~

-~~to verify the inactive RAID array without making any changes.~~

-~~When you are comfortable with the proposed changes, supply~~

-~~the --fix option.~~

-

-~~***#*~~

-

-***#*Method (2): Paranoid, time-consuming, not much better than the

-~~first way. Lets assume a two-disk RAID-1 array, consisting of~~

-~~partitions /dev/hda3 and /dev/hdc3. You can~~

-~~try the following:~~

-

-***#*#fsck /dev/hda3

-***#*#

-

-***#*#fsck /dev/hdc3

-***#*#

-

-***#*#decide which of the two partitions had fewer errors,

-~~or were more easily recovered, or recovered the data~~

-~~that you wanted. Pick one, either one, to be your new~~

-~~``master'' copy. Say you picked /dev/hdc3.~~

-***#*#

-

-***#*#dd if=/dev/hdc3 of=/dev/hda3

-***#*#

-

-***#*#mkraid raid1.conf -f --only-superblock

-***#*#

-

-~~Instead of the last two steps, you can instead run~~

-~~ckraid /etc/raid1.conf --fix --force-source /dev/hdc3~~

-~~which should be a bit faster.~~

-

-~~***#*~~

-

-***#*Method (3): Lazy man's version of above. If you don't want to

-~~wait for long fsck's to complete, it is perfectly fine to skip~~

-~~the first three steps above, and move directly to the last~~

-~~two steps.~~

-~~Just be sure to run fsck /dev/md0 after you are done.~~

-~~Method (3) is actually just method (1) in disguise.~~

-~~***#*~~

-

-~~In any case, the above steps will only sync up the raid arrays.~~

-~~The file system probably needs fixing as well: for this,~~

-~~fsck needs to be run on the active, unmounted md device.~~

-

-~~With a three-disk RAID-1 array, there are more possibilities,~~

-~~such as using two disks to ''vote'' a majority answer. Tools~~

-~~to automate this do not currently (September 97) exist.~~

-

-***#

-

-***#__Q__:

-~~I have a RAID-4 or a RAID-5 (parity) setup, and lost power while~~

-~~there was disk activity. Now what do I do?~~

-

-~~__A__:~~

-~~The redundancy of RAID levels is designed to protect against a~~

-~~__disk__ failure, not against a __power__ failure.~~

-~~Since the disks in a RAID-4 or RAID-5 array do not contain a file~~

-~~system that fsck can read, there are fewer repair options. You~~

-~~cannot use fsck to do preliminary checking and/or repair; you must~~

-~~use ckraid first.~~

-

-~~The ckraid command can be safely run without the~~

-~~--fix option~~

-~~to verify the inactive RAID array without making any changes.~~

-~~When you are comfortable with the proposed changes, supply~~

-~~the --fix option.~~

-

-~~If you wish, you can try designating one of the disks as a ''failed~~

-~~disk''. Do this with the --suggest-failed-disk-mask flag.~~

-

-~~Only one bit should be set in the flag: RAID-5 cannot recover two~~

-~~failed disks.~~

-~~The mask is a binary bit mask: thus:~~

-

-~~0x1 == first disk~~

-~~0x2 == second disk~~

-~~0x4 == third disk~~

-~~0x8 == fourth disk, etc.~~

-

-~~Alternately, you can choose to modify the parity sectors, by using~~

-~~the --suggest-fix-parity flag. This will recompute the~~

-~~parity from the other sectors.~~

-

-~~The flags --suggest-failed-dsk-mask and~~

-~~--suggest-fix-parity~~

-~~can be safely used for verification. No changes are made if the~~

-~~--fix flag is not specified. Thus, you can experiment with~~

-~~different possible repair schemes.~~

-

-***#

-

-***#__Q__:

-~~My RAID-1 device, /dev/md0 consists of two hard drive~~

-~~partitions: /dev/hda3 and /dev/hdc3.~~

-~~Recently, the disk with /dev/hdc3 failed,~~

-~~and was replaced with a new disk. My best friend,~~

-~~who doesn't understand RAID, said that the correct thing to do now~~

-~~is to ''dd if=/dev/hda3 of=/dev/hdc3''.~~

-~~I tried this, but things still don't work.~~

-

-~~__A__:~~

-~~You should keep your best friend away from you computer.~~

-~~Fortunately, no serious damage has been done.~~

-~~You can recover from this by running:~~

-

-~~mkraid raid1.conf -f --only-superblock~~

-

-~~By using dd, two identical copies of the partition~~

-~~were created. This is almost correct, except that the RAID-1~~

-~~kernel extension expects the RAID superblocks to be different.~~

-~~Thus, when you try to reactivate RAID, the software will notice~~

-~~the problem, and deactivate one of the two partitions.~~

-~~By re-creating the superblock, you should have a fully usable~~

-~~system.~~

-

-***#

-

-***#__Q__:

-~~My version of mkraid doesn't have a~~

-~~--only-superblock flag. What do I do?~~

-

-~~__A__:~~

-~~The newer tools drop support for this flag, replacing it with~~

-~~the --force-resync flag. It has been reported~~

-~~that the following sequence appears to work with the latest tools~~

-~~and software:~~

-

-~~umount /web (where /dev/md0 was mounted on)~~

-~~raidstop /dev/md0~~

-~~mkraid /dev/md0 --force-resync --really-force~~

-~~raidstart /dev/md0~~

-

-~~After doing this, a cat /proc/mdstat should report~~

-~~resync in progress, and one should be able to~~

-~~mount /dev/md0 at this point.~~

-

-***#

-

-***#__Q__:

-~~My RAID-1 device, /dev/md0 consists of two hard drive~~

-~~partitions: /dev/hda3 and /dev/hdc3.~~

-~~My best (girl?)friend, who doesn't understand RAID,~~

-~~ran fsck on /dev/hda3 while I wasn't looking,~~

-~~and now the RAID won't work. What should I do?~~

-

-~~__A__:~~

-~~You should re-examine your concept of ``best friend''.~~

-~~In general, fsck should never be run on the individual~~

-~~partitions that compose a RAID array.~~

-~~Assuming that neither of the partitions are/were heavily damaged,~~

-~~no data loss has occurred, and the RAID-1 device can be recovered~~

-~~as follows:~~

-

-***##make a backup of the file system on /dev/hda3

-***##

-

-***##dd if=/dev/hda3 of=/dev/hdc3

-***##

-

-***##mkraid raid1.conf -f --only-superblock

-***##

-

-~~This should leave you with a working disk mirror.~~

-

-***#

-

-***#__Q__:

-~~Why does the above work as a recovery procedure?~~

-

-~~__A__:~~

-~~Because each of the component partitions in a RAID-1 mirror~~

-~~is a perfectly valid copy of the file system. In a pinch,~~

-~~mirroring can be disabled, and one of the partitions~~

-~~can be mounted and safely run as an ordinary, non-RAID~~

-~~file system. When you are ready to restart using RAID-1,~~

-~~then unmount the partition, and follow the above~~

-~~instructions to restore the mirror. Note that the above~~

-~~works ONLY for RAID-1, and not for any of the other levels.~~

-

-~~It may make you feel more comfortable to reverse the direction~~

-~~of the copy above: copy __from__ the disk that was untouched~~

-~~__to__ the one that was. Just be sure to fsck the final md.~~

-

-***#

-

-***#__Q__:

-~~I am confused by the above questions, but am not yet bailing out.~~

-~~Is it safe to run fsck /dev/md0 ?~~

-

-~~__A__:~~

-~~Yes, it is safe to run fsck on the md devices.~~

-~~In fact, this is the __only__ safe place to run fsck.~~

-

-***#

-

-***#__Q__:

-~~If a disk is slowly failing, will it be obvious which one it is?~~

-~~I am concerned that it won't be, and this confusion could lead to~~

-~~some dangerous decisions by a sysadmin.~~

-

-~~__A__:~~

-~~Once a disk fails, an error code will be returned from~~

-~~the low level driver to the RAID driver.~~

-~~The RAID driver will mark it as ``bad'' in the RAID superblocks~~

-~~of the ``good'' disks (so we will later know which mirrors are~~

-~~good and which aren't), and continue RAID operation~~

-~~on the remaining operational mirrors.~~

-

-~~This, of course, assumes that the disk and the low level driver~~

-~~can detect a read/write error, and will not silently corrupt data,~~

-~~for example. This is true of current drives~~

-~~(error detection schemes are being used internally),~~

-~~and is the basis of RAID operation.~~

-

-***#

-

-***#__Q__:

-~~What about hot-repair?~~

-

-~~__A__:~~

-~~Work is underway to complete ``hot reconstruction''.~~

-~~With this feature, one can add several ``spare'' disks to~~

-~~the RAID set (be it level 1 or 4/5), and once a disk fails,~~

-~~it will be reconstructed on one of the spare disks in run time,~~

-~~without ever needing to shut down the array.~~

-

-~~However, to use this feature, the spare disk must have~~

-~~been declared at boot time, or it must be hot-added,~~

-~~which requires the use of special cabinets and connectors~~

-~~that allow a disk to be added while the electrical power is~~

-~~on.~~

-

-~~As of October 97, there is a beta version of MD that~~

-~~allows:~~

-

-***#*RAID 1 and 5 reconstruction on spare drives

-~~***#*~~

-

-***#*RAID-5 parity reconstruction after an unclean

-~~shutdown~~

-~~***#*~~

-

-***#*spare disk to be hot-added to an already running

-~~RAID 1 or 4/5 array~~

-~~***#*~~

-

-~~By default, automatic reconstruction is (Dec 97) currently~~

-~~disabled by default, due to the preliminary nature of this~~

-~~work. It can be enabled by changing the value of~~

-~~SUPPORT_RECONSTRUCTION in~~

-~~include/linux/md.h.~~

-

-~~If spare drives were configured into the array when it~~

-~~was created and kernel-based reconstruction is enabled,~~

-~~the spare drive will already contain a RAID superblock~~

-~~(written by mkraid), and the kernel will~~

-~~reconstruct its contents automatically (without needing~~

-~~the usual mdstop, replace drive, ckraid,~~

-~~mdrun steps).~~

-

-~~If you are not running automatic reconstruction, and have~~

-~~not configured a hot-spare disk, the procedure described by~~

-~~Gadi Oxman~~

-<

-~~gadio@netvision.net.il>~~

-~~is recommended:~~

-

-***#*Currently, once the first disk is removed, the RAID set will be

-~~running in degraded mode. To restore full operation mode,~~

-~~you need to:~~

-

-***#**stop the array (mdstop /dev/md0)

-~~***#**~~

-

-***#**replace the failed drive

-~~***#**~~

-

-***#**run ckraid raid.conf to reconstruct its contents

-~~***#**~~

-

-***#**run the array again (mdadd, mdrun).

-~~***#**~~

-

-~~At this point, the array will be running with all the drives,~~

-~~and again protects against a failure of a single drive.~~

-~~***#*~~

-

-~~Currently, it is not possible to assign single hot-spare disk~~

-~~to several arrays. Each array requires it's own hot-spare.~~

-

-***#

-

-***#__Q__:

-~~I would like to have an audible alarm for~~

-~~``you schmuck, one disk in the mirror is down'',~~

-~~so that the novice sysadmin knows that there is a problem.~~

-

-~~__A__:~~

-~~The kernel is logging the event with a~~

-~~``KERN_ALERT'' priority in syslog.~~

-~~There are several software packages that will monitor the~~

-~~syslog files, and beep the PC speaker, call a pager, send e-mail,~~

-~~etc. automatically.~~

-

-***#

-

-***#__Q__:

-~~How do I run RAID-5 in degraded mode~~

-~~(with one disk failed, and not yet replaced)?~~

-

-~~__A__:~~

-~~Gadi Oxman~~

-<

-~~gadio@netvision.net.il>~~

-~~writes:~~

-~~Normally, to run a RAID-5 set of n drives you have to:~~

-

-~~mdadd /dev/md0 /dev/disk1 ... /dev/disk(n)~~

-~~mdrun -p5 /dev/md0~~

-

-~~Even if one of the disks has failed,~~

-~~you still have to mdadd it as you would in a normal setup.~~

-~~(?? try using /dev/null in place of the failed disk ???~~

-~~watch out)~~

-~~Then,~~

-~~The array will be active in degraded mode with (n - 1) drives.~~

-~~If ``mdrun'' fails, the kernel has noticed an error~~

-~~(for example, several faulty drives, or an unclean shutdown).~~

-~~Use ``dmesg'' to display the kernel error messages from~~

-~~``mdrun''.~~

-~~If the raid-5 set is corrupted due to a power loss,~~

-~~rather than a disk crash, one can try to recover by~~

-~~creating a new RAID superblock:~~

-

-~~mkraid -f --only-superblock raid5.conf~~

-

-~~A RAID array doesn't provide protection against a power failure or~~

-~~a kernel crash, and can't guarantee correct recovery.~~

-~~Rebuilding the superblock will simply cause the system to ignore~~

-~~the condition by marking all the drives as ``OK'',~~

-~~as if nothing happened.~~

-

-***#

-

-***#__Q__:

-~~How does RAID-5 work when a disk fails?~~

-

-~~__A__:~~

-~~The typical operating scenario is as follows:~~

-

-***#*A RAID-5 array is active.

-

-~~***#*~~

-

-***#*One drive fails while the array is active.

-

-~~***#*~~

-

-***#*The drive firmware and the low-level Linux disk/controller

-~~drivers detect the failure and report an error code to the~~

-~~MD driver.~~

-

-~~***#*~~

-

-***#*The MD driver continues to provide an error-free

-~~/dev/md0~~

-~~device to the higher levels of the kernel (with a performance~~

-~~degradation) by using the remaining operational drives.~~

-

-~~***#*~~

-

-***#*The sysadmin can umount /dev/md0 and

-~~mdstop /dev/md0 as usual.~~

-

-~~***#*~~

-

-***#*If the failed drive is not replaced, the sysadmin can still

-~~start the array in degraded mode as usual, by running~~

-~~mdadd and mdrun.~~

-~~***#*~~

-

-***#

-

-***#__Q__:

-

-~~__A__:~~

-

-***#

-

-***#__Q__:

-~~Why is there no question 13?~~

-

-~~__A__:~~

-~~If you are concerned about RAID, High Availability, and UPS,~~

-~~then its probably a good idea to be superstitious as well.~~

-~~It can't hurt, can it?~~

-

-***#

-

-***#__Q__:

-~~I just replaced a failed disk in a RAID-5 array. After~~

-~~rebuilding the array, fsck is reporting many, many~~

-~~errors. Is this normal?~~

-

-~~__A__:~~

-~~No. And, unless you ran fsck in "verify only; do not update"~~

-~~mode, its quite possible that you have corrupted your data.~~

-~~Unfortunately, a not-uncommon scenario is one of~~

-~~accidentally changing the disk order in a RAID-5 array,~~

-~~after replacing a hard drive. Although the RAID superblock~~

-~~stores the proper order, not all tools use this information.~~

-~~In particular, the current version of ckraid~~

-~~will use the information specified with the -f~~

-~~flag (typically, the file /etc/raid5.conf)~~

-~~instead of the data in the superblock. If the specified~~

-~~order is incorrect, then the replaced disk will be~~

-~~reconstructed incorrectly. The symptom of this~~

-~~kind of mistake seems to be heavy & numerous fsck~~

-~~errors.~~

-

-~~And, in case you are wondering, __yes__, someone lost~~

-~~__all__ of their data by making this mistake. Making~~

-~~a tape backup of __all__ data before reconfiguring a~~

-~~RAID array is __strongly recommended__.~~

-

-***#

-

-***#__Q__:

-~~The !QuickStart says that mdstop is just to make sure that the~~

-~~disks are sync'ed. Is this REALLY necessary? Isn't unmounting the~~

-~~file systems enough?~~

-

-~~__A__:~~

-~~The command mdstop /dev/md0 will:~~

-

-***#*mark it ''clean''. This allows us to detect unclean shutdowns, for

-~~example due to a power failure or a kernel crash.~~

-

-~~***#*~~

-

-***#*sync the array. This is less important after unmounting a

-~~filesystem, but is important if the /dev/md0 is~~

-~~accessed directly rather than through a filesystem (for~~

-~~example, by e2fsck).~~

-~~***#*~~

-

-***#

-

-~~----~~

-

-~~!!5. Troubleshooting Install Problems~~

-

-***#__Q__:

-~~What is the current best known-stable patch for RAID in the~~

-~~2..x series kernels?~~

-

-~~__A__:~~

-~~As of 18 Sept 1997, it is~~

-~~"2..30 + pre-9 2..31 + Werner Fink's swapping patch~~

-~~+ the alpha RAID patch". As of November 1997, it is~~

-~~2..31 + ... !?~~

-

-***#

-

-***#__Q__:

-~~The RAID patches will not install cleanly for me. What's wrong?~~

-

-~~__A__:~~

-~~Make sure that /usr/include/linux is a symbolic link to~~

-~~/usr/src/linux/include/linux.~~

-~~Make sure that the new files raid5.c, etc.~~

-~~have been copied to their correct locations. Sometimes~~

-~~the patch command will not create new files. Try the~~

-~~-f flag on patch.~~

-

-***#

-

-***#__Q__:

-~~While compiling raidtools .42, compilation stops trying to~~

-~~include <pthread.h> but it doesn't exist in my system.~~

-~~How do I fix this?~~

-

-~~__A__:~~

-~~raidtools-.42 requires linuxthreads-.6 from:~~

-~~ftp://ftp.inria.fr/INRIA/Projects/cristal/Xavier.Leroy~~

-~~Alternately, use glibc v2..~~

-

-***#

-

-***#__Q__:

-~~I get the message: mdrun -a /dev/md0: Invalid argument~~

-

-~~__A__:~~

-~~Use mkraid to initialize the RAID set prior to the first use.~~

-~~mkraid ensures that the RAID array is initially in a~~

-~~consistent state by erasing the RAID partitions. In addition,~~

-~~mkraid will create the RAID superblocks.~~

-

-***#

-

-***#__Q__:

-~~I get the message: mdrun -a /dev/md0: Invalid argument~~

-~~The setup was:~~

-

-***#*raid build as a kernel module

-~~***#*~~

-

-***#*normal install procedure followed ... mdcreate, mdadd, etc.

-~~***#*~~

-

-***#*cat /proc/mdstat shows

-

-~~Personalities :~~

-~~read_ahead not set~~

-~~md0 : inactive sda1 sdb1 6313482 blocks~~

-~~md1 : inactive~~

-~~md2 : inactive~~

-~~md3 : inactive~~

-

-~~***#*~~

-

-***#*mdrun -a generates the error message

-~~/dev/md0: Invalid argument~~

-~~***#*~~

-

-~~__A__:~~

-~~Try lsmod (or, alternately, cat~~

-~~/proc/modules) to see if the raid modules are loaded.~~

-~~If they are not, you can load them explicitly with~~

-~~the modprobe raid1 or modprobe raid5~~

-~~command. Alternately, if you are using the autoloader,~~

-~~and expected kerneld to load them and it didn't~~

-~~this is probably because your loader is missing the info to~~

-~~load the modules. Edit /etc/conf.modules and add~~

-~~the following lines:~~

-

-~~alias md-personality-3 raid1~~

-~~alias md-personality-4 raid5~~

-

-***#

-

-***#__Q__:

-~~While doing mdadd -a I get the error:~~

-~~/dev/md0: No such file or directory. Indeed, there~~

-~~seems to be no /dev/md0 anywhere. Now what do I do?~~

-

-~~__A__:~~

-~~The raid-tools package will create these devices when~~

-~~you run make install as root. Alternately,~~

-~~you can do the following:~~

-

-~~cd /dev~~

-~~./MAKEDEV md~~

-

-***#

-

-***#__Q__:

-~~After creating a raid array on /dev/md0,~~

-~~I try to mount it and get the following error:~~

- ~~mount: wrong fs type, bad option, bad superblock on /dev/md0,~~

-~~or too many mounted file systems. What's wrong?~~

-

-~~__A__:~~

-~~You need to create a file system on /dev/md0~~

-~~before you can mount it. Use mke2fs.~~

-

-***#

-

-***#__Q__:

-~~Truxton Fulton wrote:~~

-

-~~On my Linux 2..30 system, while doing a mkraid for a~~

-~~RAID-1 device,~~

-~~during the clearing of the two individual partitions, I got~~

-~~"Cannot allocate free page" errors appearing on the console,~~

-~~and "Unable to handle kernel paging request at virtual address ..."~~

-~~errors in the system log. At this time, the system became quite~~

-~~unusable, but it appears to recover after a while. The operation~~

-~~appears to have completed with no other errors, and I am~~

-~~successfully using my RAID-1 device. The errors are disconcerting~~

-~~though. Any ideas?~~

-

-~~__A__:~~

-~~This was a well-known bug in the 2..30 kernels. It is fixed in~~

-~~the 2..31 kernel; alternately, fall back to 2..29.~~

-

-***#

-

-***#__Q__:

-~~I'm not able to mdrun a RAID-1, RAID-4 or RAID-5 device.~~

-~~If I try to mdrun a mdadd'ed device I get~~

-~~the message ''invalid raid superblock magic''.~~

-

-~~__A__:~~

-~~Make sure that you've run the mkraid part of the install~~

-~~procedure.~~

-

-***#

-

-***#__Q__:

-~~When I access /dev/md0, the kernel spits out a~~

-~~lot of errors like md0: device not running, giving up !~~

-~~and I/O error.... I've successfully added my devices to~~

-~~the virtual device.~~

-

-~~__A__:~~

-~~To be usable, the device must be running. Use~~

-~~mdrun -px /dev/md0 where x is l for linear, 0 for~~

-~~RAID-0 or 1 for RAID-1, etc.~~

-

-***#

-

-***#__Q__:

-~~I've created a linear md-dev with 2 devices.~~

-~~cat /proc/mdstat shows~~

-~~the total size of the device, but df only shows the size of the first~~

-~~physical device.~~

-

-~~__A__:~~

-~~You must mkfs your new md-dev before using it~~

-~~the first time, so that the filesystem will cover the whole device.~~

-

-***#

-

-***#__Q__:

-~~I've set up /etc/mdtab using mdcreate, I've~~

-~~mdadd'ed, mdrun and fsck'ed~~

-~~my two /dev/mdX partitions. Everything looks~~

-~~okay before a reboot. As soon as I reboot, I get an~~

-~~fsck error on both partitions: fsck.ext2: Attempt to read block from filesystem resulted in short~~

-~~read while trying too open /dev/md0. Why?! How do~~

-~~I fix it?!~~

-

-~~__A__:~~

-~~During the boot process, the RAID partitions must be started~~

-~~before they can be fsck'ed. This must be done~~

-~~in one of the boot scripts. For some distributions,~~

-~~fsck is called from /etc/rc.d/rc.S, for others,~~

-~~it is called from /etc/rc.d/rc.sysinit. Change this~~

-~~file to mdadd -ar *before* fsck -A~~

-~~is executed. Better yet, it is suggested that~~

-~~ckraid be run if mdadd returns with an~~

-~~error. How do do this is discussed in greater detail in~~

-~~question 14 of the section ''Error Recovery''.~~

-

-***#

-

-***#__Q__:

-~~I get the message invalid raid superblock magic while~~

-~~trying to run an array which consists of partitions which are~~

-~~bigger than 4GB.~~

-

-~~__A__:~~

-~~This bug is now fixed. (September 97) Make sure you have the latest~~

-~~raid code.~~

-

-***#

-

-***#__Q__:

-~~I get the message Warning: could not write 8 blocks in inode table starting at 2097175 while trying to run mke2fs on~~

-~~a partition which is larger than 2GB.~~

-

-~~__A__:~~

-~~This seems to be a problem with mke2fs~~

-~~(November 97). A temporary work-around is to get the mke2fs~~

-~~code, and add #undef HAVE_LLSEEK to~~

-~~e2fsprogs-1.10/lib/ext2fs/llseek.c just before the~~

-~~first #ifdef HAVE_LLSEEK and recompile mke2fs.~~

-

-***#

-

-***#__Q__:

-~~ckraid currently isn't able to read /etc/mdtab~~

-

-~~__A__:~~

-~~The RAID0/linear configuration file format used in~~

-~~/etc/mdtab is obsolete, although it will be supported~~

-~~for a while more. The current, up-to-date config files~~

-~~are currently named /etc/raid1.conf, etc.~~

-

-***#

-

-***#__Q__:

-~~The personality modules (raid1.o) are not loaded automatically;~~

-~~they have to be manually modprobe'd before mdrun. How can this~~

-~~be fixed?~~

-

-~~__A__:~~

-~~To autoload the modules, we can add the following to~~

-~~/etc/conf.modules:~~

-

-~~alias md-personality-3 raid1~~

-~~alias md-personality-4 raid5~~

-

-***#

-

-***#__Q__:

-~~I've mdadd'ed 13 devices, and now I'm trying to~~

-~~mdrun -p5 /dev/md0 and get the message:~~

-~~/dev/md0: Invalid argument~~

-

-~~__A__:~~

-~~The default configuration for software RAID is 8 real~~

-~~devices. Edit linux/md.h, change~~

-~~#define MAX_REAL=8 to a larger number, and~~

-~~rebuild the kernel.~~

-

-***#

-

-***#__Q__:

-~~I can't make md work with partitions on our~~

-~~latest SPARCstation 5. I suspect that this has something~~

-~~to do with disk-labels.~~

-

-~~__A__:~~

-~~Sun disk-labels sit in the first 1K of a partition.~~

-~~For RAID-1, the Sun disk-label is not an issue since~~

-~~ext2fs will skip the label on every mirror.~~

-~~For other raid levels (, linear and 4/5), this~~

-~~appears to be a problem; it has not yet (Dec 97) been~~

-~~addressed.~~

-

-***#

-

-~~----~~

-

-~~!!6. Supported Hardware & Software~~

-

-***#__Q__:

-~~I have SCSI adapter brand XYZ (with or without several channels),~~

-~~and disk brand(s) PQR and LMN, will these work with md to create~~

-~~a linear/stripped/mirrored personality?~~

-

-~~__A__:~~

-~~Yes! Software RAID will work with any disk controller (IDE~~

-~~or SCSI) and any disks. The disks do not have to be identical,~~

-~~nor do the controllers. For example, a RAID mirror can be~~

-~~created with one half the mirror being a SCSI disk, and the~~

-~~other an IDE disk. The disks do not even have to be the same~~

-~~size. There are no restrictions on the mixing & matching of~~

-~~disks and controllers.~~

-

-~~This is because Software RAID works with disk partitions, not~~

-~~with the raw disks themselves. The only recommendation is that~~

-~~for RAID levels 1 and 5, the disk partitions that are used as part~~

-~~of the same set be the same size. If the partitions used to make~~

-~~up the RAID 1 or 5 array are not the same size, then the excess~~

-~~space in the larger partitions is wasted (not used).~~

-

-***#

-

-***#__Q__:

-~~I have a twin channel BT-952, and the box states that it supports~~

-~~hardware RAID , 1 and +1. I have made a RAID set with two~~

-~~drives, the card apparently recognizes them when it's doing it's~~

-~~BIOS startup routine. I've been reading in the driver source code,~~

-~~but found no reference to the hardware RAID support. Anybody out~~

-~~there working on that?~~

-

-~~__A__:~~

-~~The Mylex/!BusLogic !FlashPoint boards with RAIDPlus are~~

-~~actually software RAID, not hardware RAID at all. RAIDPlus~~

-~~is only supported on Windows 95 and Windows NT, not on~~

-~~Netware or any of the Unix platforms. Aside from booting and~~

-~~configuration, the RAID support is actually in the OS drivers.~~

-

-~~While in theory Linux support for RAIDPlus is possible, the~~

-~~implementation of RAID-/1/4/5 in the Linux kernel is much~~

-~~more flexible and should have superior performance, so~~

-~~there's little reason to support RAIDPlus directly.~~

-

-***#

-

-***#__Q__:

-~~I want to run RAID with an SMP box. Is RAID SMP-safe?~~

-

-~~__A__:~~

-~~"I think so" is the best answer available at the time I write~~

-~~this (April 98). A number of users report that they have been~~

-~~using RAID with SMP for nearly a year, without problems.~~

-~~However, as of April 98 (circa kernel 2.1.9x), the following~~

-~~problems have been noted on the mailing list:~~

-

-***#*Adaptec AIC7xxx SCSI drivers are not SMP safe

-~~(General note: Adaptec adapters have a long~~

-~~& lengthly history~~

-~~of problems & flakiness in general. Although~~

-~~they seem to be the most easily available, widespread~~

-~~and cheapest SCSI adapters, they should be avoided.~~

-~~After factoring for time lost, frustration, and~~

-~~corrupted data, Adaptec's will prove to be the~~

-~~costliest mistake you'll ever make. That said,~~

-~~if you have SMP problems with 2.1.88, try the patch~~

-~~ftp://ftp.bero-online.ml.org/pub/linux/aic7xxx-5..7-linux21.tar.gz~~

-~~I am not sure if this patch has been pulled into later~~

-~~2.1.x kernels.~~

-~~For further info, take a look at the mail archives for~~

-~~March 98 at~~

-~~http://www.linuxhq.com/lnxlists/linux-raid/lr_9803_01/~~

-~~As usual, due to the rapidly changing nature of the~~

-~~latest experimental 2.1.x kernels, the problems~~

-~~described in these mailing lists may or may not have~~

-~~been fixed by the time your read this. Caveat Emptor.~~

-)

-

-~~***#*~~

-

-***#*IO-APIC with RAID-0 on SMP has been reported

-~~to crash in 2.1.90~~

-~~***#*~~

-

-***#

-

-~~----~~

-

-~~!!7. Modifying an Existing Installation~~

-

-***#__Q__:

-~~Are linear MD's expandable?~~

-~~Can a new hard-drive/partition be added,~~

-~~and the size of the existing file system expanded?~~

-

-~~__A__:~~

-~~Miguel de Icaza~~

-<

-~~miguel@luthien.nuclecu.unam.mx>~~

-~~writes:~~

-

-~~I changed the ext2fs code to be aware of multiple-devices~~

-~~instead of the regular one device per file system assumption.~~

-

-~~So, when you want to extend a file system,~~

-~~you run a utility program that makes the appropriate changes~~

-~~on the new device (your extra partition) and then you just tell~~

-~~the system to extend the fs using the specified device.~~

-

-~~You can extend a file system with new devices at system operation~~

-~~time, no need to bring the system down~~

-~~(and whenever I get some extra time, you will be able to remove~~

-~~devices from the ext2 volume set, again without even having~~

-~~to go to single-user mode or any hack like that).~~

-

-~~You can get the patch for 2.1.x kernel from my web page:~~

-

-~~http://www.nuclecu.unam.mx/~miguel/ext2-volume~~

-

-***#

-

-***#__Q__:

-~~Can I add disks to a RAID-5 array?~~

-

-~~__A__:~~

-~~Currently, (September 1997) no, not without erasing all~~

-~~data. A conversion utility to allow this does not yet exist.~~

-~~The problem is that the actual structure and layout~~

-~~of a RAID-5 array depends on the number of disks in the array.~~

-~~Of course, one can add drives by backing up the array to tape,~~

-~~deleting all data, creating a new array, and restoring from~~

-~~tape.~~

-

-***#

-

-***#__Q__:

-~~What would happen to my RAID1/RAID0 sets if I shift one~~

-~~of the drives from being /dev/hdb to /dev/hdc?~~

-~~Because of cabling/case size/stupidity issues, I had to~~

-~~make my RAID sets on the same IDE controller (/dev/hda~~

-~~and /dev/hdb). Now that I've fixed some stuff, I want~~

-~~to move /dev/hdb to /dev/hdc.~~

-~~What would happen if I just change the /etc/mdtab and~~

-~~/etc/raid1.conf files to reflect the new location?~~

-

-~~__A__:~~

-~~For RAID-/linear, one must be careful to specify the~~

-~~drives in exactly the same order. Thus, in the above~~

-~~example, if the original config is~~

-

-~~mdadd /dev/md0 /dev/hda /dev/hdb~~

-

-~~Then the new config *must* be~~

-

-~~mdadd /dev/md0 /dev/hda /dev/hdc~~

-

-~~For RAID-1/4/5, the drive's ''RAID number'' is stored in~~

-~~its RAID superblock, and therefore the order in which the~~

-~~disks are specified is not important.~~

-~~RAID-/linear does not have a superblock due to it's older~~

-~~design, and the desire to maintain backwards compatibility~~

-~~with this older design.~~

-

-***#

-

-***#__Q__:

-~~Can I convert a two-disk RAID-1 mirror to a three-disk RAID-5 array?~~

-

-~~__A__:~~

-~~Yes. Michael at !BizSystems has come up with a clever,~~

-~~sneaky way of doing this. However, like virtually all~~

-~~manipulations of RAID arrays once they have data on~~

-~~them, it is dangerous and prone to human error.~~

-~~__Make a backup before you start__.~~

-

-~~I will make the following assumptions:~~

-~~---------------------------------------------~~

-~~disks~~

-~~original: hda - hdc~~

-~~raid1 partitions hda3 - hdc3~~

-~~array name /dev/md0~~

-~~new hda - hdc - hdd~~

-~~raid5 partitions hda3 - hdc3 - hdd3~~

-~~array name: /dev/md1~~

-~~You must substitute the appropriate disk and partition numbers for~~

-~~you system configuration. This will hold true for all config file~~

-~~examples.~~

-~~--------------------------------------------~~

-~~DO A BACKUP BEFORE YOU DO ANYTHING~~

-~~1) recompile kernel to include both raid1 and raid5~~

-~~2) install new kernel and verify that raid personalities are present~~

-~~3) disable the redundant partition on the raid 1 array. If this is a~~

-~~root mounted partition (mine was) you must be more careful.~~

-~~Reboot the kernel without starting raid devices or boot from rescue~~

-~~system ( raid tools must be available )~~

-~~start non-redundant raid1~~

-~~mdadd -r -p1 /dev/md0 /dev/hda3~~

-~~4) configure raid5 but with 'funny' config file, note that there is~~

-~~no hda3 entry and hdc3 is repeated. This is needed since the~~

-~~raid tools don't want you to do this.~~

-~~-------------------------------~~

-~~# raid-5 configuration~~

-~~raiddev /dev/md1~~

-~~raid-level 5~~

-~~nr-raid-disks 3~~

-~~chunk-size 32~~

-~~# Parity placement algorithm~~

-~~parity-algorithm left-symmetric~~

-~~# Spare disks for hot reconstruction~~

-~~nr-spare-disks~~

-~~device /dev/hdc3~~

-~~raid-disk~~

-~~device /dev/hdc3~~

-~~raid-disk 1~~

-~~device /dev/hdd3~~

-~~raid-disk 2~~

-~~---------------------------------------~~

-~~mkraid /etc/raid5.conf~~

-~~5) activate the raid5 array in non-redundant mode~~

-~~mdadd -r -p5 -c32k /dev/md1 /dev/hdc3 /dev/hdd3~~

-~~6) make a file system on the array~~

-~~mke2fs -b {blocksize} /dev/md1~~

-~~recommended blocksize by some is 4096 rather than the default 1024.~~

-~~this improves the memory utilization for the kernel raid routines and~~

-~~matches the blocksize to the page size. I compromised and used 2048~~

-~~since I have a relatively high number of small files on my system.~~

-~~7) mount the two raid devices somewhere~~

-~~mount -t ext2 /dev/md0 mnt0~~

-~~mount -t ext2 /dev/md1 mnt1~~

-~~8) move the data~~

-~~cp -a mnt0 mnt1~~

-~~9) verify that the data sets are identical~~

-~~10) stop both arrays~~

-~~11) correct the information for the raid5.conf file~~

-~~change /dev/md1 to /dev/md0~~

-~~change the first disk to read /dev/hda3~~

-~~12) upgrade the new array to full redundant status~~

-~~(THIS DESTROYS REMAINING raid1 INFORMATION)~~

-~~ckraid --fix /etc/raid5.conf~~

-

-***#

-

-~~----~~

-

-~~!!8. Performance, Tools & General Bone-headed Questions~~

-

-***#__Q__:

-~~I've created a RAID-0 device on /dev/sda2 and~~

-~~/dev/sda3. The device is a lot slower than a~~

-~~single partition. Isn't md a pile of junk?~~

-

-~~__A__:~~

-~~To have a RAID-0 device running a full speed, you must~~

-~~have partitions from different disks. Besides, putting~~

-~~the two halves of the mirror on the same disk fails to~~

-~~give you any protection whatsoever against disk failure.~~

-

-***#

-

-***#__Q__:

-~~What's the use of having RAID-linear when RAID-0 will do the~~

-~~same thing, but provide higher performance?~~

-

-~~__A__:~~

-~~It's not obvious that RAID-0 will always provide better~~

-~~performance; in fact, in some cases, it could make things~~

-~~worse.~~

-~~The ext2fs file system scatters files all over a partition,~~

-~~and it attempts to keep all of the blocks of a file~~

-~~contiguous, basically in an attempt to prevent fragmentation.~~

-~~Thus, ext2fs behaves "as if" there were a (variable-sized)~~

-~~stripe per file. If there are several disks concatenated~~

-~~into a single RAID-linear, this will result files being~~

-~~statistically distributed on each of the disks. Thus,~~

-~~at least for ext2fs, RAID-linear will behave a lot like~~

-~~RAID-0 with large stripe sizes. Conversely, RAID-~~

-~~with small stripe sizes can cause excessive disk activity~~

-~~leading to severely degraded performance if several large files~~

-~~are accessed simultaneously.~~

-

-~~In many cases, RAID-0 can be an obvious win. For example,~~

-~~imagine a large database file. Since ext2fs attempts to~~

-~~cluster together all of the blocks of a file, chances~~

-~~are good that it will end up on only one drive if RAID-linear~~

-~~is used, but will get chopped into lots of stripes if RAID-0 is~~

-~~used. Now imagine a number of (kernel) threads all trying~~

-~~to random access to this database. Under RAID-linear, all~~

-~~accesses would go to one disk, which would not be as efficient~~

-~~as the parallel accesses that RAID-0 entails.~~

-

-***#

-

-***#__Q__:

-~~How does RAID-0 handle a situation where the different stripe~~

-~~partitions are different sizes? Are the stripes uniformly~~

-~~distributed?~~

-

-~~__A__:~~

-~~To understand this, lets look at an example with three~~

-~~partitions; one that is 50MB, one 90MB and one 125MB.~~

-~~Lets call D0 the 50MB disk, D1 the 90MB disk and D2 the 125MB~~

-~~disk. When you start the device, the driver calculates 'strip~~

-~~zones'. In this case, it finds 3 zones, defined like this:~~

-

-~~Z0 : (D0/D1/D2) 3 x 50 = 150MB total in this zone~~

-~~Z1 : (D1/D2) 2 x 40 = 80MB total in this zone~~

-~~Z2 : (D2) 125-50-40 = 35MB total in this zone.~~

-

-~~You can see that the total size of the zones is the size of the~~

-~~virtual device, but, depending on the zone, the striping is~~

-~~different. Z2 is rather inefficient, since there's only one~~

-~~disk.~~

-~~Since ext2fs and most other Unix~~

-~~file systems distribute files all over the disk, you~~

-~~have a 35/265 = 13% chance that a fill will end up~~

-~~on Z2, and not get any of the benefits of striping.~~

-~~(DOS tries to fill a disk from beginning to end, and thus,~~

-~~the oldest files would end up on Z0. However, this~~

-~~strategy leads to severe filesystem fragmentation,~~

-~~which is why no one besides DOS does it this way.)~~

-

-***#

-

-***#__Q__:

-~~I have some Brand X hard disks and a Brand Y controller.~~

-~~and am considering using md.~~

-~~Does it significantly increase the throughput?~~

-~~Is the performance really noticeable?~~

-

-~~__A__:~~

-~~The answer depends on the configuration that you use.~~

-

-~~; __Linux MD RAID-0 and RAID-linear performance:__:~~

-

-~~If the system is heavily loaded with lots of I/O,~~

-~~statistically, some of it will go to one disk, and~~

-~~some to the others. Thus, performance will improve~~

-~~over a single large disk. The actual improvement~~

-~~depends a lot on the actual data, stripe sizes, and~~

-~~other factors. In a system with low I/O usage,~~

-~~the performance is equal to that of a single disk.~~

-

-~~; __Linux MD RAID-1 (mirroring) read performance:__:~~

-

-~~MD implements read balancing. That is, the RAID-1~~

-~~code will alternate between each of the (two or more)~~

-~~disks in the mirror, making alternate reads to each.~~

-~~In a low-I/O situation, this won't change performance~~

-~~at all: you will have to wait for one disk to complete~~

-~~the read.~~

-~~But, with two disks in a high-I/O environment,~~

-~~this could as much as double the read performance,~~

-~~since reads can be issued to each of the disks in parallel.~~

-~~For N disks in the mirror, this could improve performance~~

-~~N-fold.~~

-

-~~; __Linux MD RAID-1 (mirroring) write performance:__:~~

-

-~~Must wait for the write to occur to all of the disks~~

-~~in the mirror. This is because a copy of the data~~

-~~must be written to each of the disks in the mirror.~~

-~~Thus, performance will be roughly equal to the write~~

-~~performance to a single disk.~~

-

-~~; __Linux MD RAID-4/5 read performance:__:~~

-

-~~Statistically, a given block can be on any one of a number~~

-~~of disk drives, and thus RAID-4/5 read performance is~~

-~~a lot like that for RAID-. It will depend on the data, the~~

-~~stripe size, and the application. It will not be as good~~

-~~as the read performance of a mirrored array.~~

-

-~~; __Linux MD RAID-4/5 write performance:__:~~

-

-~~This will in general be considerably slower than that for~~

-~~a single disk. This is because the parity must be written~~

-~~out to one drive as well as the data to another. However,~~

-~~in order to compute the new parity, the old parity and~~

-~~the old data must be read first. The old data, new data and~~

-~~old parity must all be XOR'ed together to determine the new~~

-~~parity: this requires considerable CPU cycles in addition~~

-~~to the numerous disk accesses.~~

-

-***#

-

-***#__Q__:

-~~What RAID configuration should I use for optimal performance?~~

-

-~~__A__:~~

-~~Is the goal to maximize throughput, or to minimize latency?~~

-~~There is no easy answer, as there are many factors that~~

-~~affect performance:~~

-

-***#*operating system - will one process/thread, or many

-~~be performing disk access?~~

-~~***#*~~

-

-***#*application - is it accessing data in a

-~~sequential fashion, or random access?~~

-~~***#*~~

-

-***#*file system - clusters files or spreads them out

-~~(the ext2fs clusters together the blocks of a file,~~

-~~and spreads out files)~~

-~~***#*~~

-

-***#*disk driver - number of blocks to read ahead

-~~(this is a tunable parameter)~~

-~~***#*~~

-

-***#*CEC hardware - one drive controller, or many?

-~~***#*~~

-

-***#*hd controller - able to queue multiple requests or not?

-~~Does it provide a cache?~~

-~~***#*~~

-

-***#*hard drive - buffer cache memory size -- is it big

-~~enough to handle the write sizes and rate you want?~~

-~~***#*~~

-

-***#*physical platters - blocks per cylinder -- accessing

-~~blocks on different cylinders will lead to seeks.~~

-~~***#*~~

-

-***#

-

-***#__Q__:

-~~What is the optimal RAID-5 configuration for performance?~~

-

-~~__A__:~~

-~~Since RAID-5 experiences an I/O load that is equally~~

-~~distributed~~

-~~across several drives, the best performance will be~~

-~~obtained when the RAID set is balanced by using~~

-~~identical drives, identical controllers, and the~~

-~~same (low) number of drives on each controller.~~

-~~Note, however, that using identical components will~~

-~~raise the probability of multiple simultaneous failures,~~

-~~for example due to a sudden jolt or drop, overheating,~~

-~~or a power surge during an electrical storm. Mixing~~

-~~brands and models helps reduce this risk.~~

-

-***#

-

-***#__Q__:

-~~What is the optimal block size for a RAID-4/5 array?~~

-

-~~__A__:~~

-~~When using the current (November 1997) RAID-4/5~~

-~~implementation, it is strongly recommended that~~

-~~the file system be created with mke2fs -b 4096~~

-~~instead of the default 1024 byte filesystem block size.~~

-

-~~This is because the current RAID-5 implementation~~

-~~allocates one 4K memory page per disk block;~~

-~~if a disk block were just 1K in size, then~~

-~~75% of the memory which RAID-5 is allocating for~~

-~~pending I/O would not be used. If the disk block~~

-~~size matches the memory page size, then the~~

-~~driver can (potentially) use all of the page.~~

-~~Thus, for a filesystem with a 4096 block size as~~

-~~opposed to a 1024 byte block size, the RAID driver~~

-~~will potentially queue 4 times as much~~

-~~pending I/O to the low level drivers without~~

-~~allocating additional memory.~~

-

-~~__Note__: the above remarks do NOT apply to Software~~

-~~RAID-/1/linear driver.~~

-

-~~__Note:__ the statements about 4K memory page size apply to the~~

-~~Intel x86 architecture. The page size on Alpha, Sparc, and other~~

-~~CPUS are different; I believe they're 8K on Alpha/Sparc (????).~~

-~~Adjust the above figures accordingly.~~

-

-~~__Note:__ if your file system has a lot of small~~

-~~files (files less than 10KBytes in size), a considerable~~

-~~fraction of the disk space might be wasted. This is~~

-~~because the file system allocates disk space in multiples~~

-~~of the block size. Allocating large blocks for small files~~

-~~clearly results in a waste of disk space: thus, you may~~

-~~want to stick to small block sizes, get a larger effective~~

-~~storage capacity, and not worry about the "wasted" memory~~

-~~due to the block-size/page-size mismatch.~~

-

-~~__Note:__ most ''typical'' systems do not have that many~~

-~~small files. That is, although there might be thousands~~

-~~of small files, this would lead to only some 10 to 100MB~~

-~~wasted space, which is probably an acceptable tradeoff for~~

-~~performance on a multi-gigabyte disk.~~

-

-~~However, for news servers, there might be tens or hundreds~~

-~~of thousands of small files. In such cases, the smaller~~

-~~block size, and thus the improved storage capacity,~~

-~~may be more important than the more efficient I/O~~

-~~scheduling.~~

-

-~~__Note:__ there exists an experimental file system for Linux~~

-~~which packs small files and file chunks onto a single block.~~

-~~It apparently has some very positive performance~~

-~~implications when the average file size is much smaller than~~

-~~the block size.~~

-

-~~Note: Future versions may implement schemes that obsolete~~

-~~the above discussion. However, this is difficult to~~

-~~implement, since dynamic run-time allocation can lead to~~

-~~dead-locks; the current implementation performs a static~~

-~~pre-allocation.~~

-

-***#

-

-***#__Q__:

-~~How does the chunk size (stripe size) influence the speed of~~

-~~my RAID-, RAID-4 or RAID-5 device?~~

-

-~~__A__:~~

-~~The chunk size is the amount of data contiguous on the~~

-~~virtual device that is also contiguous on the physical~~

-~~device. In this HOWTO, "chunk" and "stripe" refer to~~

-~~the same thing: what is commonly called the "stripe"~~

-~~in other RAID documentation is called the "chunk"~~

-~~in the MD man pages. Stripes or chunks apply only to~~

-~~RAID , 4 and 5, since stripes are not used in~~

-~~mirroring (RAID-1) and simple concatenation (RAID-linear).~~

-~~The stripe size affects both read and write latency (delay),~~

-~~throughput (bandwidth), and contention between independent~~

-~~operations (ability to simultaneously service overlapping I/O~~

-~~requests).~~

-

-~~Assuming the use of the ext2fs file system, and the current~~

-~~kernel policies about read-ahead, large stripe sizes are almost~~

-~~always better than small stripe sizes, and stripe sizes~~

-~~from about a fourth to a full disk cylinder in size~~

-~~may be best. To understand this claim, let us consider the~~

-~~effects of large stripes on small files, and small stripes~~

-~~on large files. The stripe size does~~

-~~not affect the read performance of small files: For an~~

-~~array of N drives, the file has a 1/N probability of~~

-~~being entirely within one stripe on any one of the drives.~~

-~~Thus, both the read latency and bandwidth will be comparable~~

-~~to that of a single drive. Assuming that the small files~~

-~~are statistically well distributed around the filesystem,~~

-~~(and, with the ext2fs file system, they should be), roughly~~

-~~N times more overlapping, concurrent reads should be possible~~

-~~without significant collision between them. Conversely, if~~

-~~very small stripes are used, and a large file is read sequentially,~~

-~~then a read will issued to all of the disks in the array.~~

-~~For a the read of a single large file, the latency will almost~~

-~~double, as the probability of a block being 3/4'ths of a~~

-~~revolution or farther away will increase. Note, however,~~

-~~the trade-off: the bandwidth could improve almost N-fold~~

-~~for reading a single, large file, as N drives can be reading~~

-~~simultaneously (that is, if read-ahead is used so that all~~

-~~of the disks are kept active). But there is another,~~

-~~counter-acting trade-off: if all of the drives are already busy~~

-~~reading one file, then attempting to read a second or third~~

-~~file at the same time will cause significant contention,~~

-~~ruining performance as the disk ladder algorithms lead to~~

-~~seeks all over the platter. Thus, large stripes will almost~~

-~~always lead to the best performance. The sole exception is~~

-~~the case where one is streaming a single, large file at a~~

-~~time, and one requires the top possible bandwidth, and one~~

-~~is also using a good read-ahead algorithm, in which case small~~

-~~stripes are desired.~~

-

-~~Note that this HOWTO previously recommended small stripe~~

-~~sizes for news spools or other systems with lots of small~~

-~~files. This was bad advice, and~~ here~~'s why: news spools~~

-~~contain not only many small files, but also large summary~~

-~~files, as well as large directories. If the summary file~~

-~~is larger than the stripe size, reading it will cause~~

-~~many disks to be accessed, slowing things down as each~~

-~~disk performs a seek. Similarly, the current ext2fs~~

-~~file system searches directories in a linear, sequential~~

-~~fashion. Thus, to find a given file or inode, on average~~

-~~half of the directory will be read. If this directory is~~

-~~spread across several stripes (several disks), the~~

-~~directory read (e.g. due to the ls command) could get~~

-~~very slow. Thanks to Steven A. Reisman~~

-<

-~~sar@pressenter.com> for this correction.~~

-~~Steve also adds:~~

-

-~~I found that using a 256k stripe gives much better performance.~~

-~~I suspect that the optimum size would be the size of a disk~~

-~~cylinder (or maybe the size of the disk drive's sector cache).~~

-~~However, disks nowadays have recording zones with different~~

-~~sector counts (and sector caches vary among different disk~~

-~~models). There's no way to guarantee stripes won't cross a~~

-~~cylinder boundary.~~

-

-~~The tools accept the stripe size specified in KBytes.~~

-~~You'll want to specify a multiple of if the page size~~

-~~for your CPU (4KB on the x86).~~

-

-***#

-

-***#__Q__:

-~~What is the correct stride factor to use when creating the~~

-~~ext2fs file system on the RAID partition? By stride, I mean~~

-~~the -R flag on the mke2fs command:~~

-

-~~mke2fs -b 4096 -R stride=nnn ...~~

-

-~~What should the value of nnn be?~~

-

-~~__A__:~~

-~~The -R stride flag is used to tell the file system~~

-~~about the size of the RAID stripes. Since only RAID-,4 and 5~~

-~~use stripes, and RAID-1 (mirroring) and RAID-linear do not,~~

-~~this flag is applicable only for RAID-,4,5.~~

-~~Knowledge of the size of a stripe allows mke2fs~~

-~~to allocate the block and inode bitmaps so that they don't~~

-~~all end up on the same physical drive. An unknown contributor~~

-~~wrote:~~

-

-~~I noticed last spring that one drive in a pair always had a~~

-~~larger I/O count, and tracked it down to the these meta-data~~

-~~blocks. Ted added the -R stride= option in response~~

-~~to my explanation and request for a workaround.~~

-

-~~For a 4KB block file system, with stripe size 256KB, one would~~

-~~use -R stride=64.~~

-

-~~If you don't trust the -R flag, you can get a similar~~

-~~effect in a different way. Steven A. Reisman~~

-<

-~~sar@pressenter.com> writes:~~

-

-~~Another consideration is the filesystem used on the RAID-0 device.~~

-~~The ext2 filesystem allocates 8192 blocks per group. Each group~~

-~~has its own set of inodes. If there are 2, 4 or 8 drives, these~~

-~~inodes cluster on the first disk. I've distributed the inodes~~

-~~across all drives by telling mke2fs to allocate only 7932 blocks~~

-~~per group.~~

-

-~~Some mke2fs pages do not describe the [[-g blocks-per-group]~~

-~~flag used in this operation.~~

-

-***#

-

-***#__Q__:

-~~Where can I put the md commands in the startup scripts,~~

-~~so that everything will start automatically at boot time?~~

-

-~~__A__:~~

-~~Rod Wilkens~~

-<

-~~rwilkens@border.net>~~

-~~writes:~~

-

-~~What I did is put ``mdadd -ar'' in~~

-~~the ``/etc/rc.d/rc.sysinit'' right after the kernel~~

-~~loads the modules, and before the ``fsck'' disk check.~~

-~~This way, you can put the ``/dev/md?'' device in the~~

-~~``/etc/fstab''. Then I put the ``mdstop -a''~~

-~~right after the ``umount -a'' unmounting the disks,~~

-~~in the ``/etc/rc.d/init.d/halt'' file.~~

-

-~~For raid-5, you will want to look at the return code~~

-~~for mdadd, and if it failed, do a~~

-

-~~ckraid --fix /etc/raid5.conf~~

-

-~~to repair any damage.~~

-

-***#

-

-***#__Q__:

-~~I was wondering if it's possible to setup striping with more~~

-~~than 2 devices in md0? This is for a news server,~~

-~~and I have 9 drives... Needless to say I need much more than two.~~

-~~Is this possible?~~

-

-~~__A__:~~

-~~Yes. (describe how to do this)~~

-

-***#

-

-***#__Q__:

-~~When is Software RAID superior to Hardware RAID?~~

-

-~~__A__:~~

-~~Normally, Hardware RAID is considered superior to Software~~

-~~RAID, because hardware controllers often have a large cache,~~

-~~and can do a better job of scheduling operations in parallel.~~

-~~However, integrated Software RAID can (and does) gain certain~~

-~~advantages from being close to the operating system.~~

-

-~~For example, ... ummm. Opaque description of caching of~~

-~~reconstructed blocks in buffer cache elided ...~~

-

-~~On a dual PPro SMP system, it has been reported that~~

-~~Software-RAID performance exceeds the performance of a~~

-~~well-known hardware-RAID board vendor by a factor of~~

-~~2 to 5.~~

-

-~~Software RAID is also a very interesting option for~~

-~~high-availability redundant server systems. In such~~

-~~a configuration, two CPU's are attached to one set~~

-~~or SCSI disks. If one server crashes or fails to~~

-~~respond, then the other server can mdadd,~~

-~~mdrun and mount the software RAID~~

-~~array, and take over operations. This sort of dual-ended~~

-~~operation is not always possible with many hardware~~

-~~RAID controllers, because of the state configuration that~~

-~~the hardware controllers maintain.~~

-

-***#

-

-***#__Q__:

-~~If I upgrade my version of raidtools, will it have trouble~~

-~~manipulating older raid arrays? In short, should I recreate my~~

-~~RAID arrays when upgrading the raid utilities?~~

-

-~~__A__:~~

-~~No, not unless the major version number changes.~~

-~~An MD version x.y.z consists of three sub-versions:~~

-

-~~x: Major version.~~

-~~y: Minor version.~~

-~~z: Patchlevel version.~~

-

-~~Version x1.y1.z1 of the RAID driver supports a RAID array with~~

-~~version x2.y2.z2 in case (x1 == x2) and (y1 >= y2).~~

-~~Different patchlevel (z) versions for the same (x.y) version are~~

-~~designed to be mostly compatible.~~

-

-~~The minor version number is increased whenever the RAID array layout~~

-~~is changed in a way which is incompatible with older versions of the~~

-~~driver. New versions of the driver will maintain compatibility with~~

-~~older RAID arrays.~~

-

-~~The major version number will be increased if it will no longer make~~

-~~sense to support old RAID arrays in the new kernel code.~~

-

-~~For RAID-1, it's not likely that the disk layout nor the~~

-~~superblock structure will change anytime soon. Most all~~

-~~Any optimization and new features (reconstruction, multithreaded~~

-~~tools, hot-plug, etc.) doesn't affect the physical layout.~~

-

-***#

-

-***#__Q__:

-~~The command mdstop /dev/md0 says that the device is busy.~~

-

-~~__A__:~~

-~~There's a process that has a file open on /dev/md0, or~~

-~~/dev/md0 is still mounted. Terminate the process or~~

-~~umount /dev/md0.~~

-

-***#

-

-***#__Q__:

-~~Are there performance tools?~~

-

-~~__A__:~~

-~~There is also a new utility called iotrace in the~~

-~~linux/iotrace~~

-~~directory. It reads /proc/io-trace and analyses/plots it's~~

-~~output. If you feel your system's block IO performance is too~~

-~~low, just look at the iotrace output.~~

-

-***#

-

-***#__Q__:

-~~I was reading the RAID source, and saw the value~~

-~~SPEED_LIMIT defined as 1024K/sec. What does this mean?~~

-~~Does this limit performance?~~

-

-~~__A__:~~

-~~SPEED_LIMIT is used to limit RAID reconstruction~~

-~~speed during automatic reconstruction. Basically, automatic~~

-~~reconstruction allows you to e2fsck and~~

-~~mount immediately after an unclean shutdown,~~

-~~without first running ckraid. Automatic~~

-~~reconstruction is also used after a failed hard drive~~

-~~has been replaced.~~

-

-~~In order to avoid overwhelming the system while~~

-~~reconstruction is occurring, the reconstruction thread~~

-~~monitors the reconstruction speed and slows it down if~~

-~~its too fast. The 1M/sec limit was arbitrarily chosen~~

-~~as a reasonable rate which allows the reconstruction to~~

-~~finish reasonably rapidly, while creating only a light load~~

-~~on the system so that other processes are not interfered with.~~

-

-***#

-

-***#__Q__:

-~~What about ''spindle synchronization'' or ''disk~~

-~~synchronization''?~~

-

-~~__A__:~~

-~~Spindle synchronization is used to keep multiple hard drives~~

-~~spinning at exactly the same speed, so that their disk~~

-~~platters are always perfectly aligned. This is used by some~~

-~~hardware controllers to better organize disk writes.~~

-~~However, for software RAID, this information is not used,~~

-~~and spindle synchronization might even hurt performance.~~

-

-***#

-

-***#__Q__:

-~~How can I set up swap spaces using raid ?~~

-~~Wouldn't striped swap ares over 4+ drives be really fast?~~

-

-~~__A__:~~

-~~Leonard N. Zubkoff replies:~~

-~~It is really fast, but you don't need to use MD to get striped~~

-~~swap. The kernel automatically stripes across equal priority~~

-~~swap spaces. For example, the following entries from~~

-~~/etc/fstab stripe swap space across five drives in~~

-~~three groups:~~

-

-~~/dev/sdg1 swap swap pri=3~~

-~~/dev/sdk1 swap swap pri=3~~

-~~/dev/sdd1 swap swap pri=3~~

-~~/dev/sdh1 swap swap pri=3~~

-~~/dev/sdl1 swap swap pri=3~~

-~~/dev/sdg2 swap swap pri=2~~

-~~/dev/sdk2 swap swap pri=2~~

-~~/dev/sdd2 swap swap pri=2~~

-~~/dev/sdh2 swap swap pri=2~~

-~~/dev/sdl2 swap swap pri=2~~

-~~/dev/sdg3 swap swap pri=1~~

-~~/dev/sdk3 swap swap pri=1~~

-~~/dev/sdd3 swap swap pri=1~~

-~~/dev/sdh3 swap swap pri=1~~

-~~/dev/sdl3 swap swap pri=1~~

-

-***#

-

-***#__Q__:

-~~I want to maximize performance. Should I use multiple~~

-~~controllers?~~

-

-~~__A__:~~

-~~In many cases, the answer is yes. Using several~~

-~~controllers to perform disk access in parallel will~~

-~~improve performance. However, the actual improvement~~

-~~depends on your actual configuration. For example,~~

-~~it has been reported (Vaughan Pratt, January 98) that~~

-~~a single 4.3GB Cheetah attached to an Adaptec 2940UW~~

-~~can achieve a rate of 14MB/sec (without using RAID).~~

-~~Installing two disks on one controller, and using~~

-~~a RAID-0 configuration results in a measured performance~~

-~~of 27 MB/sec.~~

-

-~~Note that the 2940UW controller is an "Ultra-Wide"~~

-~~SCSI controller, capable of a theoretical burst rate~~

-~~of 40MB/sec, and so the above measurements are not~~

-~~surprising. However, a slower controller attached~~

-~~to two fast disks would be the bottleneck. Note also,~~

-~~that most out-board SCSI enclosures (e.g. the kind~~

-~~with hot-pluggable trays) cannot be run at the 40MB/sec~~

-~~rate, due to cabling and electrical noise problems.~~

-

-~~If you are designing a multiple controller system,~~

-~~remember that most disks and controllers typically~~

-~~run at 70-85% of their rated max speeds.~~

-

-~~Note also that using one controller per disk~~

-~~can reduce the likelihood of system outage~~

-~~due to a controller or cable failure (In theory --~~

-~~only if the device driver for the controller can~~

-~~gracefully handle a broken controller. Not all~~

-~~SCSI device drivers seem to be able to handle such~~

-~~a situation without panicking or otherwise locking up).~~

-

-***#

-

-~~----~~

-

-~~!!9. High Availability RAID~~

-

-***#__Q__:

-~~RAID can help protect me against data loss. But how can I also~~

-~~ensure that the system is up as long as possible, and not prone~~

-~~to breakdown? Ideally, I want a system that is up 24 hours a~~

-~~day, 7 days a week, 365 days a year.~~

-

-~~__A__:~~

-~~High-Availability is difficult and expensive. The harder~~

-~~you try to make a system be fault tolerant, the harder~~

-~~and more expensive it gets. The following hints, tips,~~

-~~ideas and unsubstantiated rumors may help you with this~~

-~~quest.~~

-

-***#*IDE disks can fail in such a way that the failed disk

-~~on an IDE ribbon can also prevent the good disk on the~~

-~~same ribbon from responding, thus making it look as~~

-~~if two disks have failed. Since RAID does not~~

-~~protect against two-disk failures, one should either~~

-~~put only one disk on an IDE cable, or if there are two~~

-~~disks, they should belong to different RAID sets.~~

-~~***#*~~

-

-***#*SCSI disks can fail in such a way that the failed disk

-~~on a SCSI chain can prevent any device on the chain~~

-~~from being accessed. The failure mode involves a~~

-~~short of the common (shared) device ready pin;~~

-~~since this pin is shared, no arbitration can occur~~

-~~until the short is removed. Thus, no two disks on the~~

-~~same SCSI chain should belong to the same RAID array.~~

-~~***#*~~

-

-***#*Similar remarks apply to the disk controllers.

-~~Don't load up the channels on one controller; use~~

-~~multiple controllers.~~

-~~***#*~~

-

-***#*Don't use the same brand or model number for all of

-~~the disks. It is not uncommon for severe electrical~~

-~~storms to take out two or more disks. (Yes, we~~

-~~all use surge suppressors, but these are not perfect~~

-~~either). Heat & poor ventilation of the disk~~

-~~enclosure are other disk killers. Cheap disks~~

-~~often run hot.~~

-~~Using different brands of disk & controller~~

-~~decreases the likelihood that whatever took out one disk~~

-~~(heat, physical shock, vibration, electrical surge)~~

-~~will also damage the others on the same date.~~

-~~***#*~~

-

-***#*To guard against controller or CPU failure,

-~~it should be possible to build a SCSI disk enclosure~~

-~~that is "twin-tailed": i.e. is connected to two~~

-~~computers. One computer will mount the file-systems~~

-~~read-write, while the second computer will mount them~~

-~~read-only, and act as a hot spare. When the hot-spare~~

-~~is able to determine that the master has failed (e.g.~~

-~~through a watchdog), it will cut the power to the~~

-~~master (to make sure that it's really off), and then~~

-~~fsck & remount read-write. If anyone gets~~

-~~this working, let me know.~~

-~~***#*~~

-

-***#*Always use an UPS, and perform clean shutdowns.

-~~Although an unclean shutdown may not damage the disks,~~

-~~running ckraid on even small-ish arrays is painfully~~

-~~slow. You want to avoid running ckraid as much as~~

-~~possible. Or you can hack on the kernel and get the~~

-~~hot-reconstruction code debugged ...~~

-~~***#*~~

-

-***#*SCSI cables are well-known to be very temperamental

-~~creatures, and prone to cause all sorts of problems.~~

-~~Use the highest quality cabling that you can find for~~

-~~sale. Use e.g. bubble-wrap to make sure that ribbon~~

-~~cables to not get too close to one another and~~

-~~cross-talk. Rigorously observe cable-length~~

-~~restrictions.~~

-~~***#*~~

-

-***#*Take a look at SSI (Serial Storage Architecture).

-~~Although it is rather expensive, it is rumored~~

-~~to be less prone to the failure modes that SCSI~~

-~~exhibits.~~

-~~***#*~~

-

-***#*Enjoy yourself, its later than you think.

-~~***#*~~

-

-***#

-

-~~----~~

-

-~~!!10. Questions Waiting for Answers~~

-

-***#__Q__:

-~~If, for cost reasons, I try to mirror a slow disk with a fast disk,~~

-~~is the S/W smart enough to balance the reads accordingly or will it~~

-~~all slow down to the speed of the slowest?~~

-

-***#

-

-***#__Q__:

-~~For testing the raw disk thru put...~~

-~~is there a character device for raw read/raw writes instead of~~

-~~/dev/sdaxx that we can use to measure performance~~

-~~on the raid drives??~~

-~~is there a GUI based tool to use to watch the disk thru-put??~~

-

-***#

-

-~~----~~

-

-~~!!11. Wish List of Enhancements to MD and Related Software~~

-

-~~Bradley Ward Allen~~

-<

-~~ulmo@Q.Net>~~

-~~wrote:~~

-

-~~Ideas include:~~

-

-****Boot-up parameters to tell the kernel which devices are

-~~to be MD devices (no more ``mdadd'')~~

-~~****~~

-

-****Making MD transparent to ``mount''/``umount''

-~~such that there is no ``mdrun'' and ``mdstop''~~

-~~****~~

-

-****Integrating ``ckraid'' entirely into the kernel,

-~~and letting it run as needed~~

-~~****~~

-

-~~(So far, all I've done is suggest getting rid of the tools and putting~~

-~~them into the kernel; that's how I feel about it,~~

-~~this is a filesystem, not a toy.)~~

-

-****Deal with arrays that can easily survive N disks going out

-~~simultaneously or at separate moments,~~

-~~where N is a whole number > 0 settable by the administrator~~

-~~****~~

-

-****Handle kernel freezes, power outages,

-~~and other abrupt shutdowns better~~

-~~****~~

-

-****Don't disable a whole disk if only parts of it have failed,

-~~e.g., if the sector errors are confined to less than 50% of~~

-~~access over the attempts of 20 dissimilar requests,~~

-~~then it continues just ignoring those sectors of that particular~~

-~~disk.~~

-~~****~~

-

-****Bad sectors:

-

-*****A mechanism for saving which sectors are bad,

-~~someplace onto the disk.~~

-~~*****~~

-

-*****If there is a generalized mechanism for marking degraded

-~~bad blocks that upper filesystem levels can recognize,~~

-~~use that. Program it if not.~~

-~~*****~~

-

-*****Perhaps alternatively a mechanism for telling the upper

-~~layer that the size of the disk got smaller,~~

-~~even arranging for the upper layer to move out stuff from~~

-~~the areas being eliminated.~~

-~~This would help with a degraded blocks as well.~~

-~~*****~~

-

-*****Failing the above ideas, keeping a small (admin settable)

-~~amount of space aside for bad blocks (distributed evenly~~

-~~across disk?), and using them (nearby if possible)~~

-~~instead of the bad blocks when it does happen.~~

-~~Of course, this is inefficient.~~

-~~Furthermore, the kernel ought to log every time the RAID~~

-~~array starts each bad sector and what is being done about~~

-~~it with a ``crit'' level warning, just to get~~

-~~the administrator to realize that his disk has a piece of~~

-~~dust burrowing into it (or a head with platter sickness).~~

-~~*****~~

-

-~~****~~

-

-****Software-switchable disks:

-

-~~; __``disable this disk''__:~~

-

-~~would block until kernel has completed making sure~~

-~~there is no data on the disk being shut down~~

-~~that is needed (e.g., to complete an XOR/ECC/other error~~

-~~correction), then release the disk from use~~

-~~(so it could be removed, etc.);~~

-~~; __``enable this disk''__:~~

-

-~~would mkraid a new disk if appropriate~~

-~~and then start using it for ECC/whatever operations,~~

-~~enlarging the RAID5 array as it goes;~~

-~~; __``resize array''__:~~

-

-~~would respecify the total number of disks~~

-~~and the number of redundant disks, and the result~~

-~~would often be to resize the size of the array;~~

-~~where no data loss would result,~~

-~~doing this as needed would be nice,~~

-~~but I have a hard time figuring out how it would do that;~~

-~~in any case, a mode where it would block~~

-~~(for possibly hours (kernel ought to log something every~~

-~~ten seconds if so)) would be necessary;~~

-~~; __``enable this disk while saving data''__:~~

-

-~~which would save the data on a disk as-is and move it~~

-~~to the RAID5 system as needed, so that a horrific save~~

-~~and restore would not have to happen every time someone~~

-~~brings up a RAID5 system (instead, it may be simpler to~~

-~~only save one partition instead of two,~~

-~~it might fit onto the first as a gzip'd file even);~~

-~~finally,~~

-~~; __``re-enable disk''__:~~

-

-~~would be an operator's hint to the OS to try out~~

-~~a previously failed disk (it would simply call disable~~

-~~then enable, I suppose).~~

-

-~~****~~

-

-~~Other ideas off the net:~~

-

-****finalrd analog to initrd, to simplify root raid.

-~~****~~

-

-****a read-only raid mode, to simplify the above

-~~****~~

-

-****Mark the RAID set as clean whenever there are no

-~~"half writes" done. -- That is, whenever there are no write~~

-~~transactions that were committed on one disk but still~~

-~~unfinished on another disk.~~

-~~Add a "write inactivity" timeout (to avoid frequent seeks~~

-~~to the RAID superblock when the RAID set is relatively~~

-~~busy)~~ .

-

-~~****~~

-

-~~----~~

+Describe [HowToSoftwareRAID0 .4xHOWTO ] here.