Penguin
Diff: HowToSoftwareRAID0.4xHOWTO
EditPageHistoryDiffInfoLikePages

Differences between current version and previous revision of HowToSoftwareRAID0.4xHOWTO.

Other diffs: Previous Major Revision, Previous Author, or view the Annotated Edit History

Newer page: version 3 Last edited on Tuesday, October 26, 2004 11:07:56 am by AristotlePagaltzis
Older page: version 2 Last edited on Friday, June 7, 2002 1:07:36 am by perry Revert
@@ -1,3506 +1 @@
-  
-  
-  
-Software-RAID HOWTO  
-  
-  
-  
-----  
-  
-!!!Software-RAID HOWTO  
-  
-!!Linas Vepstas, linas@linas.orgv0.54, 21 November 1998  
-  
-  
-----  
-''RAID stands for ''Redundant Array of Inexpensive Disks'', and  
-is meant to be a way of creating a fast and reliable disk-drive  
-subsystem out of individual disks. RAID can guard against disk failure, and can also improve performance over that of a single disk drive.  
-This document is a tutorial/HOWTO/FAQ for users of  
-the Linux MD kernel extension, the associated tools, and their use.  
-The MD extension implements RAID-0 (striping), RAID-1 (mirroring),  
-RAID-4 and RAID-5 in software. That is, with MD, no special hardware  
-or disk controllers are required to get many of the benefits of RAID.''  
-----  
-  
-  
-  
-  
-; __Preamble__:  
-  
-This document is copyrighted and GPL'ed by Linas Vepstas  
-(  
-linas@linas.org).  
-Permission to use, copy, distribute this document for any purpose is  
-hereby granted, provided that the author's / editor's name and  
-this notice appear in all copies and/or supporting documents; and  
-that an unmodified version of this document is made freely available.  
-This document is distributed in the hope that it will be useful, but  
-WITHOUT ANY WARRANTY, either expressed or implied. While every effort  
-has been taken to ensure the accuracy of the information documented  
-herein, the author / editor / maintainer assumes NO RESPONSIBILITY  
-for any errors, or for any damages, direct or consequential, as a  
-result of the use of the information documented herein.  
-  
-  
-  
-  
-  
-__RAID, although designed to improve system reliability by adding  
-redundancy, can also lead to a false sense of security and confidence  
-when used improperly. This false confidence can lead to even greater  
-disasters. In particular, note that RAID is designed to protect against  
-*disk* failures, and not against *power* failures or *operator*  
-mistakes. Power failures, buggy development kernels, or operator/admin  
-errors can lead to damaged data that it is not recoverable!  
-RAID is *not* a substitute for proper backup of your system.  
-Know what you are doing, test, be knowledgeable and aware!__  
-  
-  
-  
-  
-  
-!!1. Introduction  
-  
-  
-  
-  
-!!2. Understanding RAID  
-  
-  
-  
-  
-!!3. Setup & Installation Considerations  
-  
-  
-  
-  
-!!4. Error Recovery  
-  
-  
-  
-  
-!!5. Troubleshooting Install Problems  
-  
-  
-  
-  
-!!6. Supported Hardware & Software  
-  
-  
-  
-  
-!!7. Modifying an Existing Installation  
-  
-  
-  
-  
-!!8. Performance, Tools & General Bone-headed Questions  
-  
-  
-  
-  
-!!9. High Availability RAID  
-  
-  
-  
-  
-!!10. Questions Waiting for Answers  
-  
-  
-  
-  
-!!11. Wish List of Enhancements to MD and Related Software  
-----  
-  
-!!1. Introduction  
-  
-  
-  
-  
-  
-***#__Q__:  
-What is RAID?  
-  
-__A__:  
-RAID stands for "Redundant Array of Inexpensive Disks",  
-and is meant to be a way of creating a fast and reliable disk-drive  
-subsystem out of individual disks. In the PC world, "I" has come to  
-stand for "Independent", where marketing forces continue to  
-differentiate IDE and SCSI. In it's original meaning, "I" meant  
-"Inexpensive as compared to refrigerator-sized mainframe  
-3380 DASD", monster drives which made nice houses look cheap,  
-and diamond rings look like trinkets.  
-  
-  
-***#  
-  
-***#__Q__:  
-What is this document?  
-  
-__A__:  
-This document is a tutorial/HOWTO/FAQ for users of the Linux MD  
-kernel extension, the associated tools, and their use.  
-The MD extension implements RAID-0 (striping), RAID-1 (mirroring),  
-RAID-4 and RAID-5 in software. That is, with MD, no special  
-hardware or disk controllers are required to get many of the  
-benefits of RAID.  
-  
-  
-This document is __NOT__ an introduction to RAID;  
-you must find this elsewhere.  
-  
-  
-***#  
-  
-***#__Q__:  
-What levels of RAID does the Linux kernel implement?  
-  
-__A__:  
-Striping (RAID-) and linear concatenation are a part  
-of the stock 2.x series of kernels. This code is  
-of production quality; it is well understood and well  
-maintained. It is being used in some very large USENET  
-news servers.  
-  
-  
-RAID-1, RAID-4 & RAID-5 are a part of the 2.1.63 and greater  
-kernels. For earlier 2..x and 2.1.x kernels, patches exist  
-that will provide this function. Don't feel obligated to  
-upgrade to 2.1.63; upgrading the kernel is hard; it is *much*  
-easier to patch an earlier kernel. Most of the RAID user  
-community is running 2..x kernels, and that's where most  
-of the historic RAID development has focused. The current  
-snapshots should be considered near-production quality; that  
-is, there are no known bugs but there are some rough edges and  
-untested system setups. There are a large number of people  
-using Software RAID in a production environment.  
-  
-  
-  
-  
-  
-RAID-1 hot reconstruction has been recently introduced  
-(August 1997) and should be considered alpha quality.  
-RAID-5 hot reconstruction will be alpha quality any day now.  
-  
-  
-  
-  
-  
-A word of caution about the 2.1.x development kernels:  
-these are less than stable in a variety of ways. Some of  
-the newer disk controllers (e.g. the Promise Ultra's) are  
-supported only in the 2.1.x kernels. However, the 2.1.x  
-kernels have seen frequent changes in the block device driver,  
-in the DMA and interrupt code, in the PCI, IDE and SCSI code,  
-and in the disk controller drivers. The combination of  
-these factors, coupled to cheapo hard drives and/or  
-low-quality ribbon cables can lead to considerable  
-heartbreak. The ckraid tool, as well as  
-fsck and mount put considerable stress  
-on the RAID subsystem. This can lead to hard lockups  
-during boot, where even the magic alt-!SysReq key sequence  
-won't save the day. Use caution with the 2.1.x kernels,  
-and expect trouble. Or stick to the 2..34 kernel.  
-  
-  
-***#  
-  
-***#__Q__:  
-I'm running an older kernel. Where do I get patches?  
-  
-__A__:  
-Software RAID-0 and linear mode are a stock part of  
-all current Linux kernels. Patches for Software RAID-1,4,5  
-are available from  
-http://luthien.nuclecu.unam.mx/~miguel/raid.  
-See also the quasi-mirror  
-ftp://linux.kernel.org/pub/linux/daemons/raid/  
-for patches, tools and other goodies.  
-  
-  
-***#  
-  
-***#__Q__:  
-Are there other Linux RAID references?  
-  
-__A__:  
-  
-  
-***#*Generic RAID overview:  
-http://www.dpt.com/uraiddoc.html.  
-***#*  
-  
-***#*General Linux RAID options:  
-http://linas.org/linux/raid.html.  
-***#*  
-  
-***#*Latest version of this document:  
-http://linas.org/linux/Software-RAID/Software-RAID.html.  
-***#*  
-  
-***#*Linux-RAID mailing list archive:  
-http://www.linuxhq.com/lnxlists/.  
-***#*  
-  
-***#*Linux Software RAID Home Page:  
-http://luthien.nuclecu.unam.mx/~miguel/raid.  
-***#*  
-  
-***#*Linux Software RAID tools:  
-ftp://linux.kernel.org/pub/linux/daemons/raid/.  
-***#*  
-  
-***#*How to setting up linear/stripped Software RAID:  
-http://www.ssc.com/lg/issue17/raid.html.  
-***#*  
-  
-***#*Bootable RAID mini-HOWTO:  
-ftp://ftp.bizsystems.com/pub/raid/bootable-raid.  
-***#*  
-  
-***#*Root RAID HOWTO:  
-ftp://ftp.bizsystems.com/pub/raid/Root-RAID-HOWTO.  
-***#*  
-  
-***#*Linux RAID-Geschichten:  
-http://www.infodrom.north.de/~joey/Linux/raid/.  
-***#*  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-Who do I blame for this document?  
-  
-__A__:  
-Linas Vepstas slapped this thing together.  
-However, most of the information,  
-and some of the words were supplied by  
-  
-  
-***#*Bradley Ward Allen  
-<  
-ulmo@Q.Net>  
-***#*  
-  
-***#*Luca Berra  
-<  
-bluca@comedia.it>  
-***#*  
-  
-***#*Brian Candler  
-<  
-B.Candler@pobox.com>  
-***#*  
-  
-***#*Bohumil Chalupa  
-<  
-bochal@apollo.karlov.mff.cuni.cz>  
-***#*  
-  
-***#*Rob Hagopian  
-<  
-hagopiar@vu.union.edu>  
-***#*  
-  
-***#*Anton Hristozov  
-<  
-anton@intransco.com>  
-***#*  
-  
-***#*Miguel de Icaza  
-<  
-miguel@luthien.nuclecu.unam.mx>  
-***#*  
-  
-***#*Marco Meloni  
-<  
-tonno@stud.unipg.it>  
-***#*  
-  
-***#*Ingo Molnar  
-<  
-mingo@pc7537.hil.siemens.at>  
-***#*  
-  
-***#*Alvin Oga  
-<  
-alvin@planet.fef.com>  
-***#*  
-  
-***#*Gadi Oxman  
-<  
-gadio@netvision.net.il>  
-***#*  
-  
-***#*Vaughan Pratt  
-<  
-pratt@cs.Stanford.EDU>  
-***#*  
-  
-***#*Steven A. Reisman  
-<  
-sar@pressenter.com>  
-***#*  
-  
-***#*Michael Robinton  
-<  
-michael@bzs.org>  
-***#*  
-  
-***#*Martin Schulze  
-<  
-joey@finlandia.infodrom.north.de>  
-***#*  
-  
-***#*Geoff Thompson  
-<  
-geofft@cs.waikato.ac.nz>  
-***#*  
-  
-***#*Edward Welbon  
-<  
-welbon@bga.com>  
-***#*  
-  
-***#*Rod Wilkens  
-<  
-rwilkens@border.net>  
-***#*  
-  
-***#*Johan Wiltink  
-<  
-j.m.wiltink@pi.net>  
-***#*  
-  
-***#*Leonard N. Zubkoff  
-<  
-lnz@dandelion.com>  
-***#*  
-  
-***#*Marc ZYNGIER  
-<  
-zyngier@ufr-info-p7.ibp.fr>  
-***#*  
-  
-  
-  
-__Copyrights__  
-  
-  
-***#*Copyright (C) 1994-96 Marc ZYNGIER  
-***#*  
-  
-***#*Copyright (C) 1997 Gadi Oxman, Ingo Molnar, Miguel de Icaza  
-***#*  
-  
-***#*Copyright (C) 1997, 1998 Linas Vepstas  
-***#*  
-  
-***#*By copyright law, additional copyrights are implicitly held  
-by the contributors listed above.  
-***#*  
-  
-  
-  
-Thanks all for being there!  
-  
-  
-***#  
-  
-----  
-  
-!!2. Understanding RAID  
-  
-  
-  
-  
-  
-***#__Q__:  
-What is RAID? Why would I ever use it?  
-  
-__A__:  
-RAID is a way of combining multiple disk drives into a single  
-entity to improve performance and/or reliability. There are  
-a variety of different types and implementations of RAID, each  
-with its own advantages and disadvantages. For example, by  
-putting a copy of the same data on two disks (called  
-__disk mirroring__, or RAID level 1), read performance can be  
-improved by reading alternately from each disk in the mirror.  
-On average, each disk is less busy, as it is handling only  
-1/2 the reads (for two disks), or 1/3 (for three disks), etc.  
-In addition, a mirror can improve reliability: if one disk  
-fails, the other disk(s) have a copy of the data. Different  
-ways of combining the disks into one, referred to as  
-__RAID levels__, can provide greater storage efficiency  
-than simple mirroring, or can alter latency (access-time)  
-performance, or throughput (transfer rate) performance, for  
-reading or writing, while still retaining redundancy that  
-is useful for guarding against failures.  
-  
-  
-__Although RAID can protect against disk failure, it does  
-not protect against operator and administrator (human)  
-error, or against loss due to programming bugs (possibly  
-due to bugs in the RAID software itself). The net abounds with  
-tragic tales of system administrators who have bungled a RAID  
-installation, and have lost all of their data. RAID is not a  
-substitute for frequent, regularly scheduled backup.__  
-  
-  
-RAID can be implemented  
-in hardware, in the form of special disk controllers, or in  
-software, as a kernel module that is layered in between the  
-low-level disk driver, and the file system which sits above it.  
-RAID hardware is always a "disk controller", that is, a device  
-to which one can cable up the disk drives. Usually it comes  
-in the form of an adapter card that will plug into a  
-ISA/EISA/PCI/S-Bus/!MicroChannel slot. However, some RAID  
-controllers are in the form of a box that connects into  
-the cable in between the usual system disk controller, and  
-the disk drives. Small ones may fit into a drive bay; large  
-ones may be built into a storage cabinet with its own drive  
-bays and power supply. The latest RAID hardware used with  
-the latest & fastest CPU will usually provide the best overall  
-performance, although at a significant price. This is because  
-most RAID controllers come with on-board DSP's and memory  
-cache that can off-load a considerable amount of processing  
-from the main CPU, as well as allow high transfer rates into  
-the large controller cache. Old RAID hardware can act as  
-a "de-accelerator" when used with newer CPU's: yesterday's  
-fancy DSP and cache can act as a bottleneck, and it's  
-performance is often beaten by pure-software RAID and new  
-but otherwise plain, run-of-the-mill disk controllers.  
-RAID hardware can offer an advantage over pure-software  
-RAID, if it can makes use of disk-spindle synchronization  
-and its knowledge of the disk-platter position with  
-regard to the disk head, and the desired disk-block.  
-However, most modern (low-cost) disk drives do not offer  
-this information and level of control anyway, and thus,  
-most RAID hardware does not take advantage of it.  
-RAID hardware is usually  
-not compatible across different brands, makes and models:  
-if a RAID controller fails, it must be replaced by another  
-controller of the same type. As of this writing (June 1998),  
-a broad variety of hardware controllers will operate under Linux;  
-however, none of them currently come with configuration  
-and management utilities that run under Linux.  
-  
-  
-Software-RAID is a set of kernel modules, together with  
-management utilities that implement RAID purely in software,  
-and require no extraordinary hardware. The Linux RAID subsystem  
-is implemented as a layer in the kernel that sits above the  
-low-level disk drivers (for IDE, SCSI and Paraport drives),  
-and the block-device interface. The filesystem, be it ext2fs,  
-DOS-FAT, or other, sits above the block-device interface.  
-Software-RAID, by its very software nature, tends to be more  
-flexible than a hardware solution. The downside is that it  
-of course requires more CPU cycles and power to run well  
-than a comparable hardware system. Of course, the cost  
-can't be beat. Software RAID has one further important  
-distinguishing feature: it operates on a partition-by-partition  
-basis, where a number of individual disk partitions are  
-ganged together to create a RAID partition. This is in  
-contrast to most hardware RAID solutions, which gang together  
-entire disk drives into an array. With hardware, the fact that  
-there is a RAID array is transparent to the operating system,  
-which tends to simplify management. With software, there  
-are far more configuration options and choices, tending to  
-complicate matters.  
-  
-  
-__As of this writing (June 1998), the administration of RAID  
-under Linux is far from trivial, and is best attempted by  
-experienced system administrators. The theory of operation  
-is complex. The system tools require modification to startup  
-scripts. And recovery from disk failure is non-trivial,  
-and prone to human error. RAID is not for the novice,  
-and any benefits it may bring to reliability and performance  
-can be easily outweighed by the extra complexity. Indeed,  
-modern disk drives are incredibly reliable and modern  
-CPU's and controllers are quite powerful. You might more  
-easily obtain the desired reliability and performance levels  
-by purchasing higher-quality and/or faster hardware.__  
-  
-  
-***#  
-  
-***#__Q__:  
-What are RAID levels? Why so many? What distinguishes them?  
-  
-__A__:  
-The different RAID levels have different performance,  
-redundancy, storage capacity, reliability and cost  
-characteristics. Most, but not all levels of RAID  
-offer redundancy against disk failure. Of those that  
-offer redundancy, RAID-1 and RAID-5 are the most popular.  
-RAID-1 offers better performance, while RAID-5 provides  
-for more efficient use of the available storage space.  
-However, tuning for performance is an entirely different  
-matter, as performance depends strongly on a large variety  
-of factors, from the type of application, to the sizes of  
-stripes, blocks, and files. The more difficult aspects of  
-performance tuning are deferred to a later section of this HOWTO.  
-  
-  
-The following describes the different RAID levels in the  
-context of the Linux software RAID implementation.  
-  
-  
-  
-  
-  
-***#*__RAID-linear__  
-is a simple concatenation of partitions to create  
-a larger virtual partition. It is handy if you have a number  
-small drives, and wish to create a single, large partition.  
-This concatenation offers no redundancy, and in fact  
-decreases the overall reliability: if any one disk  
-fails, the combined partition will fail.  
-  
-  
-  
-  
-  
-  
-  
-***#*  
-  
-***#*__RAID-1__ is also referred to as "mirroring".  
-Two (or more) partitions, all of the same size, each store  
-an exact copy of all data, disk-block by disk-block.  
-Mirroring gives strong protection against disk failure:  
-if one disk fails, there is another with the an exact copy  
-of the same data. Mirroring can also help improve  
-performance in I/O-laden systems, as read requests can  
-be divided up between several disks. Unfortunately,  
-mirroring is also the least efficient in terms of storage:  
-two mirrored partitions can store no more data than a  
-single partition.  
-  
-  
-  
-  
-  
-  
-  
-***#*  
-  
-***#*__Striping__ is the underlying concept behind all of  
-the other RAID levels. A stripe is a contiguous sequence  
-of disk blocks. A stripe may be as short as a single disk  
-block, or may consist of thousands. The RAID drivers  
-split up their component disk partitions into stripes;  
-the different RAID levels differ in how they organize the  
-stripes, and what data they put in them. The interplay  
-between the size of the stripes, the typical size of files  
-in the file system, and their location on the disk is what  
-determines the overall performance of the RAID subsystem.  
-  
-  
-  
-  
-  
-  
-  
-***#*  
-  
-***#*__RAID-__ is much like RAID-linear, except that  
-the component partitions are divided into stripes and  
-then interleaved. Like RAID-linear, the result is a single  
-larger virtual partition. Also like RAID-linear, it offers  
-no redundancy, and therefore decreases overall reliability:  
-a single disk failure will knock out the whole thing.  
-RAID-0 is often claimed to improve performance over the  
-simpler RAID-linear. However, this may or may not be true,  
-depending on the characteristics to the file system, the  
-typical size of the file as compared to the size of the  
-stripe, and the type of workload. The ext2fs  
-file system already scatters files throughout a partition,  
-in an effort to minimize fragmentation. Thus, at the  
-simplest level, any given access may go to one of several  
-disks, and thus, the interleaving of stripes across multiple  
-disks offers no apparent additional advantage. However,  
-there are performance differences, and they are data,  
-workload, and stripe-size dependent.  
-  
-  
-  
-  
-  
-  
-  
-***#*  
-  
-***#*__RAID-4__ interleaves stripes like RAID-, but  
-it requires an additional partition to store parity  
-information. The parity is used to offer redundancy:  
-if any one of the disks fail, the data on the remaining disks  
-can be used to reconstruct the data that was on the failed  
-disk. Given N data disks, and one parity disk, the  
-parity stripe is computed by taking one stripe from each  
-of the data disks, and XOR'ing them together. Thus,  
-the storage capacity of a an (N+1)-disk RAID-4 array  
-is N, which is a lot better than mirroring (N+1) drives,  
-and is almost as good as a RAID-0 setup for large N.  
-Note that for N=1, where there is one data drive, and one  
-parity drive, RAID-4 is a lot like mirroring, in that  
-each of the two disks is a copy of each other. However,  
-RAID-4 does __NOT__ offer the read-performance  
-of mirroring, and offers considerably degraded write  
-performance. In brief, this is because updating the  
-parity requires a read of the old parity, before the new  
-parity can be calculated and written out. In an  
-environment with lots of writes, the parity disk can become  
-a bottleneck, as each write must access the parity disk.  
-  
-  
-  
-  
-  
-  
-  
-***#*  
-  
-***#*__RAID-5__ avoids the write-bottleneck of RAID-4  
-by alternately storing the parity stripe on each of the  
-drives. However, write performance is still not as good  
-as for mirroring, as the parity stripe must still be read  
-and XOR'ed before it is written. Read performance is  
-also not as good as it is for mirroring, as, after all,  
-there is only one copy of the data, not two or more.  
-RAID-5's principle advantage over mirroring is that it  
-offers redundancy and protection against single-drive  
-failure, while offering far more storage capacity when  
-used with three or more drives.  
-  
-  
-  
-  
-  
-  
-  
-***#*  
-  
-***#*__RAID-2 and RAID-3__ are seldom used anymore, and  
-to some degree are have been made obsolete by modern disk  
-technology. RAID-2 is similar to RAID-4, but stores  
-ECC information instead of parity. Since all modern disk  
-drives incorporate ECC under the covers, this offers  
-little additional protection. RAID-2 can offer greater  
-data consistency if power is lost during a write; however,  
-battery backup and a clean shutdown can offer the same  
-benefits. RAID-3 is similar to RAID-4, except that it  
-uses the smallest possible stripe size. As a result, any  
-given read will involve all disks, making overlapping  
-I/O requests difficult/impossible. In order to avoid  
-delay due to rotational latency, RAID-3 requires that  
-all disk drive spindles be synchronized. Most modern  
-disk drives lack spindle-synchronization ability, or,  
-if capable of it, lack the needed connectors, cables,  
-and manufacturer documentation. Neither RAID-2 nor RAID-3  
-are supported by the Linux Software-RAID drivers.  
-  
-  
-  
-  
-  
-  
-  
-***#*  
-  
-***#*__Other RAID levels__ have been defined by various  
-researchers and vendors. Many of these represent the  
-layering of one type of raid on top of another. Some  
-require special hardware, and others are protected by  
-patent. There is no commonly accepted naming scheme  
-for these other levels. Sometime the advantages of these  
-other systems are minor, or at least not apparent  
-until the system is highly stressed. Except for the  
-layering of RAID-1 over RAID-/linear, Linux Software  
-RAID does not support any of the other variations.  
-  
-  
-  
-  
-***#*  
-  
-  
-  
-***#  
-  
-----  
-  
-!!3. Setup & Installation Considerations  
-  
-  
-  
-  
-  
-***#__Q__:  
-What is the best way to configure Software RAID?  
-  
-__A__:  
-I keep rediscovering that file-system planning is one  
-of the more difficult Unix configuration tasks.  
-To answer your question, I can describe what we did.  
-We planned the following setup:  
-  
-  
-***#*two EIDE disks, 2.1.gig each.  
-  
-  
-disk partition mount pt. size device  
-1 1 / 300M /dev/hda1  
-1 2 swap 64M /dev/hda2  
-1 3 /home 800M /dev/hda3  
-1 4 /var 900M /dev/hda4  
-2 1 /root 300M /dev/hdc1  
-2 2 swap 64M /dev/hdc2  
-2 3 /home 800M /dev/hdc3  
-2 4 /var 900M /dev/hdc4  
-  
-  
-  
-***#*  
-  
-***#*Each disk is on a separate controller (& ribbon cable).  
-The theory is that a controller failure and/or  
-ribbon failure won't disable both disks.  
-Also, we might possibly get a performance boost  
-from parallel operations over two controllers/cables.  
-  
-***#*  
-  
-***#*Install the Linux kernel on the root (/)  
-partition /dev/hda1. Mark this partition as  
-bootable.  
-  
-***#*  
-  
-***#*/dev/hdc1 will contain a ``cold'' copy of  
-/dev/hda1. This is NOT a raid copy,  
-just a plain old copy-copy. It's there just in  
-case the first disk fails; we can use a rescue disk,  
-mark /dev/hdc1 as bootable, and use that to  
-keep going without having to reinstall the system.  
-You may even want to put /dev/hdc1's copy  
-of the kernel into LILO to simplify booting in case of  
-failure.  
-The theory here is that in case of severe failure,  
-I can still boot the system without worrying about  
-raid superblock-corruption or other raid failure modes  
-& gotchas that I don't understand.  
-  
-***#*  
-  
-***#*/dev/hda3 and /dev/hdc3 will be mirrors  
-/dev/md0.  
-***#*  
-  
-***#*/dev/hda4 and /dev/hdc4 will be mirrors  
-/dev/md1.  
-  
-***#*  
-  
-***#*we picked /var and /home to be mirrored,  
-and in separate partitions, using the following logic:  
-  
-  
-***#**/ (the root partition) will contain  
-relatively static, non-changing data:  
-for all practical purposes, it will be  
-read-only without actually being marked &  
-mounted read-only.  
-***#**  
-  
-***#**/home will contain ''slowly'' changing  
-data.  
-***#**  
-  
-***#**/var will contain rapidly changing data,  
-including mail spools, database contents and  
-web server logs.  
-***#**  
-  
-The idea behind using multiple, distinct partitions is  
-that __if__, for some bizarre reason,  
-whether it is human error, power loss, or an operating  
-system gone wild, corruption is limited to one partition.  
-In one typical case, power is lost while the  
-system is writing to disk. This will almost certainly  
-lead to a corrupted filesystem, which will be repaired  
-by fsck during the next boot. Although  
-fsck does it's best to make the repairs  
-without creating additional damage during those repairs,  
-it can be comforting to know that any such damage has been  
-limited to one partition. In another typical case,  
-the sysadmin makes a mistake during rescue operations,  
-leading to erased or destroyed data. Partitions can  
-help limit the repercussions of the operator's errors.  
-***#*  
-  
-***#*Other reasonable choices for partitions might be  
-/usr or /opt. In fact, /opt  
-and /home make great choices for RAID-5  
-partitions, if we had more disks. A word of caution:  
-__DO NOT__ put /usr in a RAID-5  
-partition. If a serious fault occurs, you may find  
-that you cannot mount /usr, and that  
-you want some of the tools on it (e.g. the networking  
-tools, or the compiler.) With RAID-1, if a fault has  
-occurred, and you can't get RAID to work, you can at  
-least mount one of the two mirrors. You can't do this  
-with any of the other RAID levels (RAID-5, striping, or  
-linear append).  
-  
-***#*  
-  
-  
-  
-So, to complete the answer to the question:  
-  
-  
-***#*install the OS on disk 1, partition 1.  
-do NOT mount any of the other partitions.  
-***#*  
-  
-***#*install RAID per instructions.  
-***#*  
-  
-***#*configure md0 and md1.  
-***#*  
-  
-***#*convince yourself that you know  
-what to do in case of a disk failure!  
-Discover sysadmin mistakes now,  
-and not during an actual crisis.  
-Experiment!  
-(we turned off power during disk activity mdash  
-this proved to be ugly but informative).  
-***#*  
-  
-***#*do some ugly mount/copy/unmount/rename/reboot scheme to  
-move /var over to the /dev/md1.  
-Done carefully, this is not dangerous.  
-***#*  
-  
-***#*enjoy!  
-***#*  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-What is the difference between the mdadd, mdrun,  
-''etc.'' commands, and the raidadd, raidrun  
-commands?  
-  
-__A__:  
-The names of the tools have changed as of the .5 release of the  
-raidtools package. The md naming convention was used  
-in the .43 and older versions, while raid is used in  
-.5 and newer versions.  
-  
-  
-***#  
-  
-***#__Q__:  
-I want to run RAID-linear/RAID-0 in the stock 2..34 kernel.  
-I don't want to apply the raid patches, since these are not  
-needed for RAID-/linear. Where can I get the raid-tools  
-to manage this?  
-  
-__A__:  
-This is a tough question, indeed, as the newest raid tools  
-package needs to have the RAID-1,4,5 kernel patches installed  
-in order to compile. I am not aware of any pre-compiled, binary  
-version of the raid tools that is available at this time.  
-However, experiments show that the raid-tools binaries, when  
-compiled against kernel 2.1.100, seem to work just fine  
-in creating a RAID-/linear partition under 2..34. A brave  
-soul has asked for these, and I've __temporarily__  
-placed the binaries mdadd, mdcreate, etc.  
-at http://linas.org/linux/Software-RAID/  
-You must get the man pages, etc. from the usual raid-tools  
-package.  
-  
-  
-***#  
-  
-***#__Q__:  
-Can I strip/mirror the root partition (/)?  
-Why can't I boot Linux directly from the md disks?  
-  
-__A__:  
-Both LILO and Loadlin need an non-stripped/mirrored partition  
-to read the kernel image from. If you want to strip/mirror  
-the root partition (/),  
-then you'll want to create an unstriped/mirrored partition  
-to hold the kernel(s).  
-Typically, this partition is named /boot.  
-Then you either use the initial ramdisk support (initrd),  
-or patches from Harald Hoyer  
-<  
-HarryH@Royal.Net>  
-that allow a stripped partition to be used as the root  
-device. (These patches are now a standard part of recent  
-2.1.x kernels)  
-  
-  
-There are several approaches that can be used.  
-One approach is documented in detail in the  
-Bootable RAID mini-HOWTO:  
-ftp://ftp.bizsystems.com/pub/raid/bootable-raid.  
-  
-  
-  
-  
-  
-Alternately, use mkinitrd to build the ramdisk image,  
-see below.  
-  
-  
-  
-  
-  
-Edward Welbon  
-<  
-welbon@bga.com>  
-writes:  
-  
-  
-***#*... all that is needed is a script to manage the boot setup.  
-To mount an md filesystem as root,  
-the main thing is to build an initial file system image  
-that has the needed modules and md tools to start md.  
-I have a simple script that does this.  
-***#*  
-  
-  
-  
-***#*For boot media, I have a small __cheap__ SCSI disk  
-(170MB I got it used for $20).  
-This disk runs on a AHA1452, but it could just as well be an  
-inexpensive IDE disk on the native IDE.  
-The disk need not be very fast since it is mainly for boot.  
-***#*  
-  
-  
-  
-***#*This disk has a small file system which contains the kernel and  
-the file system image for initrd.  
-The initial file system image has just enough stuff to allow me  
-to load the raid SCSI device driver module and start the  
-raid partition that will become root.  
-I then do an  
-  
-  
-echo 0x900 > /proc/sys/kernel/real-root-dev  
-  
-  
-(0x900 is for /dev/md0)  
-and exit linuxrc.  
-The boot proceeds normally from there.  
-***#*  
-  
-  
-  
-***#*I have built most support as a module except for the AHA1452  
-driver that brings in the initrd filesystem.  
-So I have a fairly small kernel. The method is perfectly  
-reliable, I have been doing this since before 2.1.26 and  
-have never had a problem that I could not easily recover from.  
-The file systems even survived several 2.1.4 [[45] hard  
-crashes with no real problems .  
-***#*  
-  
-  
-  
-***#*At one time I had partitioned the raid disks so that the initial  
-cylinders of the first raid disk held the kernel and the initial  
-cylinders of the second raid disk hold the initial file system  
-image, instead I made the initial cylinders of the raid disks  
-swap since they are the fastest cylinders  
-(why waste them on boot?).  
-***#*  
-  
-  
-  
-***#*The nice thing about having an inexpensive device dedicated to  
-boot is that it is easy to boot from and can also serve as  
-a rescue disk if necessary. If you are interested,  
-you can take a look at the script that builds my initial  
-ram disk image and then runs LILO.  
-  
-http://www.realtime.net/~welbon/initrd.md.tar.gz  
-It is current enough to show the picture.  
-It isn't especially pretty and it could certainly build  
-a much smaller filesystem image for the initial ram disk.  
-It would be easy to a make it more efficient.  
-But it uses LILO as is.  
-If you make any improvements, please forward a copy to me. 8-)  
-***#*  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-I have heard that I can run mirroring over striping. Is this true?  
-Can I run mirroring over the loopback device?  
-  
-__A__:  
-Yes, but not the reverse. That is, you can put a stripe over  
-several disks, and then build a mirror on top of this. However,  
-striping cannot be put on top of mirroring.  
-  
-  
-A brief technical explanation is that the linear and stripe  
-personalities use the ll_rw_blk routine for access.  
-The ll_rw_blk routine  
-maps disk devices and sectors, not blocks. Block devices can be  
-layered one on top of the other; but devices that do raw, low-level  
-disk accesses, such as ll_rw_blk, cannot.  
-  
-  
-  
-  
-  
-Currently (November 1997) RAID cannot be run over the  
-loopback devices, although this should be fixed shortly.  
-  
-  
-***#  
-  
-***#__Q__:  
-I have two small disks and three larger disks. Can I  
-concatenate the two smaller disks with RAID-, and then create  
-a RAID-5 out of that and the larger disks?  
-  
-__A__:  
-Currently (November 1997), for a RAID-5 array, no.  
-Currently, one can do this only for a RAID-1 on top of the  
-concatenated drives.  
-  
-  
-***#  
-  
-***#__Q__:  
-What is the difference between RAID-1 and RAID-5 for a two-disk  
-configuration (i.e. the difference between a RAID-1 array built  
-out of two disks, and a RAID-5 array built out of two disks)?  
-  
-__A__:  
-There is no difference in storage capacity. Nor can disks be  
-added to either array to increase capacity (see the question below for  
-details).  
-  
-  
-RAID-1 offers a performance advantage for reads: the RAID-1  
-driver uses distributed-read technology to simultaneously read  
-two sectors, one from each drive, thus doubling read performance.  
-  
-  
-  
-  
-  
-The RAID-5 driver, although it contains many optimizations, does not  
-currently (September 1997) realize that the parity disk is actually  
-a mirrored copy of the data disk. Thus, it serializes data reads.  
-  
-  
-***#  
-  
-***#__Q__:  
-How can I guard against a two-disk failure?  
-  
-__A__:  
-Some of the RAID algorithms do guard against multiple disk  
-failures, but these are not currently implemented for Linux.  
-However, the Linux Software RAID can guard against multiple  
-disk failures by layering an array on top of an array. For  
-example, nine disks can be used to create three raid-5 arrays.  
-Then these three arrays can in turn be hooked together into  
-a single RAID-5 array on top. In fact, this kind of a  
-configuration will guard against a three-disk failure. Note that  
-a large amount of disk space is ''wasted'' on the redundancy  
-information.  
-  
-  
-For an NxN raid-5 array,  
-N=3, 5 out of 9 disks are used for parity (=55%)  
-N=4, 7 out of 16 disks  
-N=5, 9 out of 25 disks  
-...  
-N=9, 17 out of 81 disks (=~20%)  
-  
-  
-In general, an MxN array will use M+N-1 disks for parity.  
-The least amount of space is "wasted" when M=N.  
-  
-  
-Another alternative is to create a RAID-1 array with  
-three disks. Note that since all three disks contain  
-identical data, that 2/3's of the space is ''wasted''.  
-  
-  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-I'd like to understand how it'd be possible to have something  
-like fsck: if the partition hasn't been cleanly unmounted,  
-fsck runs and fixes the filesystem by itself more than  
-90% of the time. Since the machine is capable of fixing it  
-by itself with ckraid --fix, why not make it automatic?  
-  
-__A__:  
-This can be done by adding lines like the following to  
-/etc/rc.d/rc.sysinit:  
-  
-mdadd /dev/md0 /dev/hda1 /dev/hdc1 || {  
-ckraid --fix /etc/raid.usr.conf  
-mdadd /dev/md0 /dev/hda1 /dev/hdc1  
-}  
-  
-or  
-  
-mdrun -p1 /dev/md0  
-if [[ $? -gt 0 ] ; then  
-ckraid --fix /etc/raid1.conf  
-mdrun -p1 /dev/md0  
-fi  
-  
-Before presenting a more complete and reliable script,  
-lets review the theory of operation.  
-Gadi Oxman writes:  
-In an unclean shutdown, Linux might be in one of the following states:  
-  
-  
-***#*The in-memory disk cache was in sync with the RAID set when  
-the unclean shutdown occurred; no data was lost.  
-  
-***#*  
-  
-***#*The in-memory disk cache was newer than the RAID set contents  
-when the crash occurred; this results in a corrupted filesystem  
-and potentially in data loss.  
-This state can be further divided to the following two states:  
-  
-  
-***#**Linux was writing data when the unclean shutdown occurred.  
-***#**  
-  
-***#**Linux was not writing data when the crash occurred.  
-***#**  
-  
-  
-***#*  
-  
-Suppose we were using a RAID-1 array. In (2a), it might happen that  
-before the crash, a small number of data blocks were successfully  
-written only to some of the mirrors, so that on the next reboot,  
-the mirrors will no longer contain the same data.  
-If we were to ignore the mirror differences, the raidtools-.36.3  
-read-balancing code  
-might choose to read the above data blocks from any of the mirrors,  
-which will result in inconsistent behavior (for example, the output  
-of e2fsck -n /dev/md0 can differ from run to run).  
-  
-  
-Since RAID doesn't protect against unclean shutdowns, usually  
-there isn't any ''obviously correct'' way to fix the mirror  
-differences and the filesystem corruption.  
-  
-  
-For example, by default ckraid --fix will choose  
-the first operational mirror and update the other mirrors  
-with its contents. However, depending on the exact timing  
-at the crash, the data on another mirror might be more recent,  
-and we might want to use it as the source  
-mirror instead, or perhaps use another method for recovery.  
-  
-  
-The following script provides one of the more robust  
-boot-up sequences. In particular, it guards against  
-long, repeated ckraid's in the presence  
-of uncooperative disks, controllers, or controller device  
-drivers. Modify it to reflect your config,  
-and copy it to rc.raid.init. Then invoke  
-rc.raid.init after the root partition has been  
-fsck'ed and mounted rw, but before the remaining partitions  
-are fsck'ed. Make sure the current directory is in the search  
-path.  
-  
-mdadd /dev/md0 /dev/hda1 /dev/hdc1 || {  
-rm -f /fastboot # force an fsck to occur  
-ckraid --fix /etc/raid.usr.conf  
-mdadd /dev/md0 /dev/hda1 /dev/hdc1  
-}  
-# if a crash occurs later in the boot process,  
-# we at least want to leave this md in a clean state.  
-/sbin/mdstop /dev/md0  
-mdadd /dev/md1 /dev/hda2 /dev/hdc2 || {  
-rm -f /fastboot # force an fsck to occur  
-ckraid --fix /etc/raid.home.conf  
-mdadd /dev/md1 /dev/hda2 /dev/hdc2  
-}  
-# if a crash occurs later in the boot process,  
-# we at least want to leave this md in a clean state.  
-/sbin/mdstop /dev/md1  
-mdadd /dev/md0 /dev/hda1 /dev/hdc1  
-mdrun -p1 /dev/md0  
-if [[ $? -gt 0 ] ; then  
-rm -f /fastboot # force an fsck to occur  
-ckraid --fix /etc/raid.usr.conf  
-mdrun -p1 /dev/md0  
-fi  
-# if a crash occurs later in the boot process,  
-# we at least want to leave this md in a clean state.  
-/sbin/mdstop /dev/md0  
-mdadd /dev/md1 /dev/hda2 /dev/hdc2  
-mdrun -p1 /dev/md1  
-if [[ $? -gt 0 ] ; then  
-rm -f /fastboot # force an fsck to occur  
-ckraid --fix /etc/raid.home.conf  
-mdrun -p1 /dev/md1  
-fi  
-# if a crash occurs later in the boot process,  
-# we at least want to leave this md in a clean state.  
-/sbin/mdstop /dev/md1  
-# OK, just blast through the md commands now. If there were  
-# errors, the above checks should have fixed things up.  
-/sbin/mdadd /dev/md0 /dev/hda1 /dev/hdc1  
-/sbin/mdrun -p1 /dev/md0  
-/sbin/mdadd /dev/md12 /dev/hda2 /dev/hdc2  
-/sbin/mdrun -p1 /dev/md1  
-  
-In addition to the above, you'll want to create a  
-rc.raid.halt which should look like the following:  
-  
-/sbin/mdstop /dev/md0  
-/sbin/mdstop /dev/md1  
-  
-Be sure to modify both rc.sysinit and  
-init.d/halt to include this everywhere that  
-filesystems get unmounted before a halt/reboot. (Note  
-that rc.sysinit unmounts and reboots if fsck  
-returned with an error.)  
-  
-  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-Can I set up one-half of a RAID-1 mirror with the one disk I have  
-now, and then later get the other disk and just drop it in?  
-  
-__A__:  
-With the current tools, no, not in any easy way. In particular,  
-you cannot just copy the contents of one disk onto another,  
-and then pair them up. This is because the RAID drivers  
-use glob of space at the end of the partition to store the  
-superblock. This decreases the amount of space available to  
-the file system slightly; if you just naively try to force  
-a RAID-1 arrangement onto a partition with an existing  
-filesystem, the  
-raid superblock will overwrite a portion of the file system  
-and mangle data. Since the ext2fs filesystem scatters  
-files randomly throughput the partition (in order to avoid  
-fragmentation), there is a very good chance that some file will  
-land at the very end of a partition long before the disk is  
-full.  
-  
-  
-If you are clever, I suppose you can calculate how much room  
-the RAID superblock will need, and make your filesystem  
-slightly smaller, leaving room for it when you add it later.  
-But then, if you are this clever, you should also be able to  
-modify the tools to do this automatically for you.  
-(The tools are not terribly complex).  
-  
-  
-  
-  
-  
-__Note:__A careful reader has pointed out that the  
-following trick may work; I have not tried or verified this:  
-Do the mkraid with /dev/null as one of the  
-devices. Then mdadd -r with only the single, true  
-disk (do not mdadd /dev/null). The mkraid  
-should have successfully built the raid array, while the  
-mdadd step just forces the system to run in "degraded" mode,  
-as if one of the disks had failed.  
-  
-  
-***#  
-  
-----  
-  
-!!4. Error Recovery  
-  
-  
-  
-  
-  
-***#__Q__:  
-I have a RAID-1 (mirroring) setup, and lost power  
-while there was disk activity. Now what do I do?  
-  
-__A__:  
-The redundancy of RAID levels is designed to protect against a  
-__disk__ failure, not against a __power__ failure.  
-There are several ways to recover from this situation.  
-  
-  
-***#*Method (1): Use the raid tools. These can be used to sync  
-the raid arrays. They do not fix file-system damage; after  
-the raid arrays are sync'ed, then the file-system still has  
-to be fixed with fsck. Raid arrays can be checked with  
-ckraid /etc/raid1.conf (for RAID-1, else,  
-/etc/raid5.conf, etc.)  
-Calling ckraid /etc/raid1.conf --fix will pick one of the  
-disks in the array (usually the first), and use that as the  
-master copy, and copy its blocks to the others in the mirror.  
-To designate which of the disks should be used as the master,  
-you can use the --force-source flag: for example,  
-ckraid /etc/raid1.conf --fix --force-source /dev/hdc3  
-The ckraid command can be safely run without the --fix  
-option  
-to verify the inactive RAID array without making any changes.  
-When you are comfortable with the proposed changes, supply  
-the --fix option.  
-  
-***#*  
-  
-***#*Method (2): Paranoid, time-consuming, not much better than the  
-first way. Lets assume a two-disk RAID-1 array, consisting of  
-partitions /dev/hda3 and /dev/hdc3. You can  
-try the following:  
-  
-  
-***#*#fsck /dev/hda3  
-***#*#  
-  
-***#*#fsck /dev/hdc3  
-***#*#  
-  
-***#*#decide which of the two partitions had fewer errors,  
-or were more easily recovered, or recovered the data  
-that you wanted. Pick one, either one, to be your new  
-``master'' copy. Say you picked /dev/hdc3.  
-***#*#  
-  
-***#*#dd if=/dev/hdc3 of=/dev/hda3  
-***#*#  
-  
-***#*#mkraid raid1.conf -f --only-superblock  
-***#*#  
-  
-Instead of the last two steps, you can instead run  
-ckraid /etc/raid1.conf --fix --force-source /dev/hdc3  
-which should be a bit faster.  
-  
-***#*  
-  
-***#*Method (3): Lazy man's version of above. If you don't want to  
-wait for long fsck's to complete, it is perfectly fine to skip  
-the first three steps above, and move directly to the last  
-two steps.  
-Just be sure to run fsck /dev/md0 after you are done.  
-Method (3) is actually just method (1) in disguise.  
-***#*  
-  
-In any case, the above steps will only sync up the raid arrays.  
-The file system probably needs fixing as well: for this,  
-fsck needs to be run on the active, unmounted md device.  
-  
-  
-With a three-disk RAID-1 array, there are more possibilities,  
-such as using two disks to ''vote'' a majority answer. Tools  
-to automate this do not currently (September 97) exist.  
-  
-  
-***#  
-  
-***#__Q__:  
-I have a RAID-4 or a RAID-5 (parity) setup, and lost power while  
-there was disk activity. Now what do I do?  
-  
-__A__:  
-The redundancy of RAID levels is designed to protect against a  
-__disk__ failure, not against a __power__ failure.  
-Since the disks in a RAID-4 or RAID-5 array do not contain a file  
-system that fsck can read, there are fewer repair options. You  
-cannot use fsck to do preliminary checking and/or repair; you must  
-use ckraid first.  
-  
-  
-The ckraid command can be safely run without the  
---fix option  
-to verify the inactive RAID array without making any changes.  
-When you are comfortable with the proposed changes, supply  
-the --fix option.  
-  
-  
-  
-  
-  
-If you wish, you can try designating one of the disks as a ''failed  
-disk''. Do this with the --suggest-failed-disk-mask flag.  
-  
-  
-Only one bit should be set in the flag: RAID-5 cannot recover two  
-failed disks.  
-The mask is a binary bit mask: thus:  
-  
-0x1 == first disk  
-0x2 == second disk  
-0x4 == third disk  
-0x8 == fourth disk, etc.  
-  
-  
-  
-Alternately, you can choose to modify the parity sectors, by using  
-the --suggest-fix-parity flag. This will recompute the  
-parity from the other sectors.  
-  
-  
-  
-  
-  
-The flags --suggest-failed-dsk-mask and  
---suggest-fix-parity  
-can be safely used for verification. No changes are made if the  
---fix flag is not specified. Thus, you can experiment with  
-different possible repair schemes.  
-  
-  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-My RAID-1 device, /dev/md0 consists of two hard drive  
-partitions: /dev/hda3 and /dev/hdc3.  
-Recently, the disk with /dev/hdc3 failed,  
-and was replaced with a new disk. My best friend,  
-who doesn't understand RAID, said that the correct thing to do now  
-is to ''dd if=/dev/hda3 of=/dev/hdc3''.  
-I tried this, but things still don't work.  
-  
-__A__:  
-You should keep your best friend away from you computer.  
-Fortunately, no serious damage has been done.  
-You can recover from this by running:  
-  
-  
-mkraid raid1.conf -f --only-superblock  
-  
-  
-By using dd, two identical copies of the partition  
-were created. This is almost correct, except that the RAID-1  
-kernel extension expects the RAID superblocks to be different.  
-Thus, when you try to reactivate RAID, the software will notice  
-the problem, and deactivate one of the two partitions.  
-By re-creating the superblock, you should have a fully usable  
-system.  
-  
-  
-***#  
-  
-***#__Q__:  
-My version of mkraid doesn't have a  
---only-superblock flag. What do I do?  
-  
-__A__:  
-The newer tools drop support for this flag, replacing it with  
-the --force-resync flag. It has been reported  
-that the following sequence appears to work with the latest tools  
-and software:  
-  
-  
-umount /web (where /dev/md0 was mounted on)  
-raidstop /dev/md0  
-mkraid /dev/md0 --force-resync --really-force  
-raidstart /dev/md0  
-  
-  
-After doing this, a cat /proc/mdstat should report  
-resync in progress, and one should be able to  
-mount /dev/md0 at this point.  
-  
-  
-***#  
-  
-***#__Q__:  
-My RAID-1 device, /dev/md0 consists of two hard drive  
-partitions: /dev/hda3 and /dev/hdc3.  
-My best (girl?)friend, who doesn't understand RAID,  
-ran fsck on /dev/hda3 while I wasn't looking,  
-and now the RAID won't work. What should I do?  
-  
-__A__:  
-You should re-examine your concept of ``best friend''.  
-In general, fsck should never be run on the individual  
-partitions that compose a RAID array.  
-Assuming that neither of the partitions are/were heavily damaged,  
-no data loss has occurred, and the RAID-1 device can be recovered  
-as follows:  
-  
-  
-***##make a backup of the file system on /dev/hda3  
-***##  
-  
-***##dd if=/dev/hda3 of=/dev/hdc3  
-***##  
-  
-***##mkraid raid1.conf -f --only-superblock  
-***##  
-  
-This should leave you with a working disk mirror.  
-  
-  
-***#  
-  
-***#__Q__:  
-Why does the above work as a recovery procedure?  
-  
-__A__:  
-Because each of the component partitions in a RAID-1 mirror  
-is a perfectly valid copy of the file system. In a pinch,  
-mirroring can be disabled, and one of the partitions  
-can be mounted and safely run as an ordinary, non-RAID  
-file system. When you are ready to restart using RAID-1,  
-then unmount the partition, and follow the above  
-instructions to restore the mirror. Note that the above  
-works ONLY for RAID-1, and not for any of the other levels.  
-  
-  
-It may make you feel more comfortable to reverse the direction  
-of the copy above: copy __from__ the disk that was untouched  
-__to__ the one that was. Just be sure to fsck the final md.  
-  
-  
-***#  
-  
-***#__Q__:  
-I am confused by the above questions, but am not yet bailing out.  
-Is it safe to run fsck /dev/md0 ?  
-  
-__A__:  
-Yes, it is safe to run fsck on the md devices.  
-In fact, this is the __only__ safe place to run fsck.  
-  
-  
-***#  
-  
-***#__Q__:  
-If a disk is slowly failing, will it be obvious which one it is?  
-I am concerned that it won't be, and this confusion could lead to  
-some dangerous decisions by a sysadmin.  
-  
-__A__:  
-Once a disk fails, an error code will be returned from  
-the low level driver to the RAID driver.  
-The RAID driver will mark it as ``bad'' in the RAID superblocks  
-of the ``good'' disks (so we will later know which mirrors are  
-good and which aren't), and continue RAID operation  
-on the remaining operational mirrors.  
-  
-  
-This, of course, assumes that the disk and the low level driver  
-can detect a read/write error, and will not silently corrupt data,  
-for example. This is true of current drives  
-(error detection schemes are being used internally),  
-and is the basis of RAID operation.  
-  
-  
-***#  
-  
-***#__Q__:  
-What about hot-repair?  
-  
-__A__:  
-Work is underway to complete ``hot reconstruction''.  
-With this feature, one can add several ``spare'' disks to  
-the RAID set (be it level 1 or 4/5), and once a disk fails,  
-it will be reconstructed on one of the spare disks in run time,  
-without ever needing to shut down the array.  
-  
-  
-However, to use this feature, the spare disk must have  
-been declared at boot time, or it must be hot-added,  
-which requires the use of special cabinets and connectors  
-that allow a disk to be added while the electrical power is  
-on.  
-  
-  
-  
-  
-  
-As of October 97, there is a beta version of MD that  
-allows:  
-  
-  
-***#*RAID 1 and 5 reconstruction on spare drives  
-***#*  
-  
-***#*RAID-5 parity reconstruction after an unclean  
-shutdown  
-***#*  
-  
-***#*spare disk to be hot-added to an already running  
-RAID 1 or 4/5 array  
-***#*  
-  
-By default, automatic reconstruction is (Dec 97) currently  
-disabled by default, due to the preliminary nature of this  
-work. It can be enabled by changing the value of  
-SUPPORT_RECONSTRUCTION in  
-include/linux/md.h.  
-  
-  
-  
-  
-  
-If spare drives were configured into the array when it  
-was created and kernel-based reconstruction is enabled,  
-the spare drive will already contain a RAID superblock  
-(written by mkraid), and the kernel will  
-reconstruct its contents automatically (without needing  
-the usual mdstop, replace drive, ckraid,  
-mdrun steps).  
-  
-  
-  
-  
-  
-If you are not running automatic reconstruction, and have  
-not configured a hot-spare disk, the procedure described by  
-Gadi Oxman  
-<  
-gadio@netvision.net.il>  
-is recommended:  
-  
-  
-***#*Currently, once the first disk is removed, the RAID set will be  
-running in degraded mode. To restore full operation mode,  
-you need to:  
-  
-  
-***#**stop the array (mdstop /dev/md0)  
-***#**  
-  
-***#**replace the failed drive  
-***#**  
-  
-***#**run ckraid raid.conf to reconstruct its contents  
-***#**  
-  
-***#**run the array again (mdadd, mdrun).  
-***#**  
-  
-At this point, the array will be running with all the drives,  
-and again protects against a failure of a single drive.  
-***#*  
-  
-  
-  
-Currently, it is not possible to assign single hot-spare disk  
-to several arrays. Each array requires it's own hot-spare.  
-  
-  
-***#  
-  
-***#__Q__:  
-I would like to have an audible alarm for  
-``you schmuck, one disk in the mirror is down'',  
-so that the novice sysadmin knows that there is a problem.  
-  
-__A__:  
-The kernel is logging the event with a  
-``KERN_ALERT'' priority in syslog.  
-There are several software packages that will monitor the  
-syslog files, and beep the PC speaker, call a pager, send e-mail,  
-etc. automatically.  
-  
-  
-***#  
-  
-***#__Q__:  
-How do I run RAID-5 in degraded mode  
-(with one disk failed, and not yet replaced)?  
-  
-__A__:  
-Gadi Oxman  
-<  
-gadio@netvision.net.il>  
-writes:  
-Normally, to run a RAID-5 set of n drives you have to:  
-  
-  
-mdadd /dev/md0 /dev/disk1 ... /dev/disk(n)  
-mdrun -p5 /dev/md0  
-  
-  
-Even if one of the disks has failed,  
-you still have to mdadd it as you would in a normal setup.  
-(?? try using /dev/null in place of the failed disk ???  
-watch out)  
-Then,  
-The array will be active in degraded mode with (n - 1) drives.  
-If ``mdrun'' fails, the kernel has noticed an error  
-(for example, several faulty drives, or an unclean shutdown).  
-Use ``dmesg'' to display the kernel error messages from  
-``mdrun''.  
-If the raid-5 set is corrupted due to a power loss,  
-rather than a disk crash, one can try to recover by  
-creating a new RAID superblock:  
-  
-  
-mkraid -f --only-superblock raid5.conf  
-  
-  
-A RAID array doesn't provide protection against a power failure or  
-a kernel crash, and can't guarantee correct recovery.  
-Rebuilding the superblock will simply cause the system to ignore  
-the condition by marking all the drives as ``OK'',  
-as if nothing happened.  
-  
-  
-***#  
-  
-***#__Q__:  
-How does RAID-5 work when a disk fails?  
-  
-__A__:  
-The typical operating scenario is as follows:  
-  
-  
-***#*A RAID-5 array is active.  
-  
-***#*  
-  
-***#*One drive fails while the array is active.  
-  
-***#*  
-  
-***#*The drive firmware and the low-level Linux disk/controller  
-drivers detect the failure and report an error code to the  
-MD driver.  
-  
-***#*  
-  
-***#*The MD driver continues to provide an error-free  
-/dev/md0  
-device to the higher levels of the kernel (with a performance  
-degradation) by using the remaining operational drives.  
-  
-***#*  
-  
-***#*The sysadmin can umount /dev/md0 and  
-mdstop /dev/md0 as usual.  
-  
-***#*  
-  
-***#*If the failed drive is not replaced, the sysadmin can still  
-start the array in degraded mode as usual, by running  
-mdadd and mdrun.  
-***#*  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-  
-__A__:  
-  
-  
-***#  
-  
-***#__Q__:  
-Why is there no question 13?  
-  
-__A__:  
-If you are concerned about RAID, High Availability, and UPS,  
-then its probably a good idea to be superstitious as well.  
-It can't hurt, can it?  
-  
-  
-***#  
-  
-***#__Q__:  
-I just replaced a failed disk in a RAID-5 array. After  
-rebuilding the array, fsck is reporting many, many  
-errors. Is this normal?  
-  
-__A__:  
-No. And, unless you ran fsck in "verify only; do not update"  
-mode, its quite possible that you have corrupted your data.  
-Unfortunately, a not-uncommon scenario is one of  
-accidentally changing the disk order in a RAID-5 array,  
-after replacing a hard drive. Although the RAID superblock  
-stores the proper order, not all tools use this information.  
-In particular, the current version of ckraid  
-will use the information specified with the -f  
-flag (typically, the file /etc/raid5.conf)  
-instead of the data in the superblock. If the specified  
-order is incorrect, then the replaced disk will be  
-reconstructed incorrectly. The symptom of this  
-kind of mistake seems to be heavy & numerous fsck  
-errors.  
-  
-  
-And, in case you are wondering, __yes__, someone lost  
-__all__ of their data by making this mistake. Making  
-a tape backup of __all__ data before reconfiguring a  
-RAID array is __strongly recommended__.  
-  
-  
-***#  
-  
-***#__Q__:  
-The !QuickStart says that mdstop is just to make sure that the  
-disks are sync'ed. Is this REALLY necessary? Isn't unmounting the  
-file systems enough?  
-  
-__A__:  
-The command mdstop /dev/md0 will:  
-  
-  
-***#*mark it ''clean''. This allows us to detect unclean shutdowns, for  
-example due to a power failure or a kernel crash.  
-  
-***#*  
-  
-***#*sync the array. This is less important after unmounting a  
-filesystem, but is important if the /dev/md0 is  
-accessed directly rather than through a filesystem (for  
-example, by e2fsck).  
-***#*  
-  
-  
-  
-***#  
-  
-----  
-  
-!!5. Troubleshooting Install Problems  
-  
-  
-  
-  
-  
-***#__Q__:  
-What is the current best known-stable patch for RAID in the  
-2..x series kernels?  
-  
-__A__:  
-As of 18 Sept 1997, it is  
-"2..30 + pre-9 2..31 + Werner Fink's swapping patch  
-+ the alpha RAID patch". As of November 1997, it is  
-2..31 + ... !?  
-  
-  
-***#  
-  
-***#__Q__:  
-The RAID patches will not install cleanly for me. What's wrong?  
-  
-__A__:  
-Make sure that /usr/include/linux is a symbolic link to  
-/usr/src/linux/include/linux.  
-Make sure that the new files raid5.c, etc.  
-have been copied to their correct locations. Sometimes  
-the patch command will not create new files. Try the  
--f flag on patch.  
-  
-  
-***#  
-  
-***#__Q__:  
-While compiling raidtools .42, compilation stops trying to  
-include <pthread.h> but it doesn't exist in my system.  
-How do I fix this?  
-  
-__A__:  
-raidtools-.42 requires linuxthreads-.6 from:  
-ftp://ftp.inria.fr/INRIA/Projects/cristal/Xavier.Leroy  
-Alternately, use glibc v2..  
-  
-  
-***#  
-  
-***#__Q__:  
-I get the message: mdrun -a /dev/md0: Invalid argument  
-  
-__A__:  
-Use mkraid to initialize the RAID set prior to the first use.  
-mkraid ensures that the RAID array is initially in a  
-consistent state by erasing the RAID partitions. In addition,  
-mkraid will create the RAID superblocks.  
-  
-  
-***#  
-  
-***#__Q__:  
-I get the message: mdrun -a /dev/md0: Invalid argument  
-The setup was:  
-  
-  
-***#*raid build as a kernel module  
-***#*  
-  
-***#*normal install procedure followed ... mdcreate, mdadd, etc.  
-***#*  
-  
-***#*cat /proc/mdstat shows  
-  
-Personalities :  
-read_ahead not set  
-md0 : inactive sda1 sdb1 6313482 blocks  
-md1 : inactive  
-md2 : inactive  
-md3 : inactive  
-  
-  
-***#*  
-  
-***#*mdrun -a generates the error message  
-/dev/md0: Invalid argument  
-***#*  
-  
-  
-__A__:  
-Try lsmod (or, alternately, cat  
-/proc/modules) to see if the raid modules are loaded.  
-If they are not, you can load them explicitly with  
-the modprobe raid1 or modprobe raid5  
-command. Alternately, if you are using the autoloader,  
-and expected kerneld to load them and it didn't  
-this is probably because your loader is missing the info to  
-load the modules. Edit /etc/conf.modules and add  
-the following lines:  
-  
-alias md-personality-3 raid1  
-alias md-personality-4 raid5  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-While doing mdadd -a I get the error:  
-/dev/md0: No such file or directory. Indeed, there  
-seems to be no /dev/md0 anywhere. Now what do I do?  
-  
-__A__:  
-The raid-tools package will create these devices when  
-you run make install as root. Alternately,  
-you can do the following:  
-  
-cd /dev  
-./MAKEDEV md  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-After creating a raid array on /dev/md0,  
-I try to mount it and get the following error:  
- mount: wrong fs type, bad option, bad superblock on /dev/md0,  
-or too many mounted file systems. What's wrong?  
-  
-__A__:  
-You need to create a file system on /dev/md0  
-before you can mount it. Use mke2fs.  
-  
-  
-***#  
-  
-***#__Q__:  
-Truxton Fulton wrote:  
-  
-On my Linux 2..30 system, while doing a mkraid for a  
-RAID-1 device,  
-during the clearing of the two individual partitions, I got  
-"Cannot allocate free page" errors appearing on the console,  
-and "Unable to handle kernel paging request at virtual address ..."  
-errors in the system log. At this time, the system became quite  
-unusable, but it appears to recover after a while. The operation  
-appears to have completed with no other errors, and I am  
-successfully using my RAID-1 device. The errors are disconcerting  
-though. Any ideas?  
-  
-  
-__A__:  
-This was a well-known bug in the 2..30 kernels. It is fixed in  
-the 2..31 kernel; alternately, fall back to 2..29.  
-  
-  
-***#  
-  
-***#__Q__:  
-I'm not able to mdrun a RAID-1, RAID-4 or RAID-5 device.  
-If I try to mdrun a mdadd'ed device I get  
-the message ''invalid raid superblock magic''.  
-  
-__A__:  
-Make sure that you've run the mkraid part of the install  
-procedure.  
-  
-  
-***#  
-  
-***#__Q__:  
-When I access /dev/md0, the kernel spits out a  
-lot of errors like md0: device not running, giving up !  
-and I/O error.... I've successfully added my devices to  
-the virtual device.  
-  
-__A__:  
-To be usable, the device must be running. Use  
-mdrun -px /dev/md0 where x is l for linear, 0 for  
-RAID-0 or 1 for RAID-1, etc.  
-  
-  
-***#  
-  
-***#__Q__:  
-I've created a linear md-dev with 2 devices.  
-cat /proc/mdstat shows  
-the total size of the device, but df only shows the size of the first  
-physical device.  
-  
-__A__:  
-You must mkfs your new md-dev before using it  
-the first time, so that the filesystem will cover the whole device.  
-  
-  
-***#  
-  
-***#__Q__:  
-I've set up /etc/mdtab using mdcreate, I've  
-mdadd'ed, mdrun and fsck'ed  
-my two /dev/mdX partitions. Everything looks  
-okay before a reboot. As soon as I reboot, I get an  
-fsck error on both partitions: fsck.ext2: Attempt to read block from filesystem resulted in short  
-read while trying too open /dev/md0. Why?! How do  
-I fix it?!  
-  
-__A__:  
-During the boot process, the RAID partitions must be started  
-before they can be fsck'ed. This must be done  
-in one of the boot scripts. For some distributions,  
-fsck is called from /etc/rc.d/rc.S, for others,  
-it is called from /etc/rc.d/rc.sysinit. Change this  
-file to mdadd -ar *before* fsck -A  
-is executed. Better yet, it is suggested that  
-ckraid be run if mdadd returns with an  
-error. How do do this is discussed in greater detail in  
-question 14 of the section ''Error Recovery''.  
-  
-  
-***#  
-  
-***#__Q__:  
-I get the message invalid raid superblock magic while  
-trying to run an array which consists of partitions which are  
-bigger than 4GB.  
-  
-__A__:  
-This bug is now fixed. (September 97) Make sure you have the latest  
-raid code.  
-  
-  
-***#  
-  
-***#__Q__:  
-I get the message Warning: could not write 8 blocks in inode table starting at 2097175 while trying to run mke2fs on  
-a partition which is larger than 2GB.  
-  
-__A__:  
-This seems to be a problem with mke2fs  
-(November 97). A temporary work-around is to get the mke2fs  
-code, and add #undef HAVE_LLSEEK to  
-e2fsprogs-1.10/lib/ext2fs/llseek.c just before the  
-first #ifdef HAVE_LLSEEK and recompile mke2fs.  
-  
-  
-***#  
-  
-***#__Q__:  
-ckraid currently isn't able to read /etc/mdtab  
-  
-__A__:  
-The RAID0/linear configuration file format used in  
-/etc/mdtab is obsolete, although it will be supported  
-for a while more. The current, up-to-date config files  
-are currently named /etc/raid1.conf, etc.  
-  
-  
-***#  
-  
-***#__Q__:  
-The personality modules (raid1.o) are not loaded automatically;  
-they have to be manually modprobe'd before mdrun. How can this  
-be fixed?  
-  
-__A__:  
-To autoload the modules, we can add the following to  
-/etc/conf.modules:  
-  
-alias md-personality-3 raid1  
-alias md-personality-4 raid5  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-I've mdadd'ed 13 devices, and now I'm trying to  
-mdrun -p5 /dev/md0 and get the message:  
-/dev/md0: Invalid argument  
-  
-__A__:  
-The default configuration for software RAID is 8 real  
-devices. Edit linux/md.h, change  
-#define MAX_REAL=8 to a larger number, and  
-rebuild the kernel.  
-  
-  
-***#  
-  
-***#__Q__:  
-I can't make md work with partitions on our  
-latest SPARCstation 5. I suspect that this has something  
-to do with disk-labels.  
-  
-__A__:  
-Sun disk-labels sit in the first 1K of a partition.  
-For RAID-1, the Sun disk-label is not an issue since  
-ext2fs will skip the label on every mirror.  
-For other raid levels (, linear and 4/5), this  
-appears to be a problem; it has not yet (Dec 97) been  
-addressed.  
-  
-  
-***#  
-  
-----  
-  
-!!6. Supported Hardware & Software  
-  
-  
-  
-  
-  
-***#__Q__:  
-I have SCSI adapter brand XYZ (with or without several channels),  
-and disk brand(s) PQR and LMN, will these work with md to create  
-a linear/stripped/mirrored personality?  
-  
-__A__:  
-Yes! Software RAID will work with any disk controller (IDE  
-or SCSI) and any disks. The disks do not have to be identical,  
-nor do the controllers. For example, a RAID mirror can be  
-created with one half the mirror being a SCSI disk, and the  
-other an IDE disk. The disks do not even have to be the same  
-size. There are no restrictions on the mixing & matching of  
-disks and controllers.  
-  
-  
-This is because Software RAID works with disk partitions, not  
-with the raw disks themselves. The only recommendation is that  
-for RAID levels 1 and 5, the disk partitions that are used as part  
-of the same set be the same size. If the partitions used to make  
-up the RAID 1 or 5 array are not the same size, then the excess  
-space in the larger partitions is wasted (not used).  
-  
-  
-***#  
-  
-***#__Q__:  
-I have a twin channel BT-952, and the box states that it supports  
-hardware RAID , 1 and +1. I have made a RAID set with two  
-drives, the card apparently recognizes them when it's doing it's  
-BIOS startup routine. I've been reading in the driver source code,  
-but found no reference to the hardware RAID support. Anybody out  
-there working on that?  
-  
-__A__:  
-The Mylex/!BusLogic !FlashPoint boards with RAIDPlus are  
-actually software RAID, not hardware RAID at all. RAIDPlus  
-is only supported on Windows 95 and Windows NT, not on  
-Netware or any of the Unix platforms. Aside from booting and  
-configuration, the RAID support is actually in the OS drivers.  
-  
-  
-While in theory Linux support for RAIDPlus is possible, the  
-implementation of RAID-/1/4/5 in the Linux kernel is much  
-more flexible and should have superior performance, so  
-there's little reason to support RAIDPlus directly.  
-  
-  
-***#  
-  
-***#__Q__:  
-I want to run RAID with an SMP box. Is RAID SMP-safe?  
-  
-__A__:  
-"I think so" is the best answer available at the time I write  
-this (April 98). A number of users report that they have been  
-using RAID with SMP for nearly a year, without problems.  
-However, as of April 98 (circa kernel 2.1.9x), the following  
-problems have been noted on the mailing list:  
-  
-  
-***#*Adaptec AIC7xxx SCSI drivers are not SMP safe  
-(General note: Adaptec adapters have a long  
-& lengthly history  
-of problems & flakiness in general. Although  
-they seem to be the most easily available, widespread  
-and cheapest SCSI adapters, they should be avoided.  
-After factoring for time lost, frustration, and  
-corrupted data, Adaptec's will prove to be the  
-costliest mistake you'll ever make. That said,  
-if you have SMP problems with 2.1.88, try the patch  
-ftp://ftp.bero-online.ml.org/pub/linux/aic7xxx-5..7-linux21.tar.gz  
-I am not sure if this patch has been pulled into later  
-2.1.x kernels.  
-For further info, take a look at the mail archives for  
-March 98 at  
-http://www.linuxhq.com/lnxlists/linux-raid/lr_9803_01/  
-As usual, due to the rapidly changing nature of the  
-latest experimental 2.1.x kernels, the problems  
-described in these mailing lists may or may not have  
-been fixed by the time your read this. Caveat Emptor.  
-)  
-  
-***#*  
-  
-***#*IO-APIC with RAID-0 on SMP has been reported  
-to crash in 2.1.90  
-***#*  
-  
-  
-  
-***#  
-  
-----  
-  
-!!7. Modifying an Existing Installation  
-  
-  
-  
-  
-  
-***#__Q__:  
-Are linear MD's expandable?  
-Can a new hard-drive/partition be added,  
-and the size of the existing file system expanded?  
-  
-__A__:  
-Miguel de Icaza  
-<  
-miguel@luthien.nuclecu.unam.mx>  
-writes:  
-  
-I changed the ext2fs code to be aware of multiple-devices  
-instead of the regular one device per file system assumption.  
-  
-  
-So, when you want to extend a file system,  
-you run a utility program that makes the appropriate changes  
-on the new device (your extra partition) and then you just tell  
-the system to extend the fs using the specified device.  
-  
-  
-  
-  
-  
-You can extend a file system with new devices at system operation  
-time, no need to bring the system down  
-(and whenever I get some extra time, you will be able to remove  
-devices from the ext2 volume set, again without even having  
-to go to single-user mode or any hack like that).  
-  
-  
-  
-  
-  
-You can get the patch for 2.1.x kernel from my web page:  
-  
-http://www.nuclecu.unam.mx/~miguel/ext2-volume  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-Can I add disks to a RAID-5 array?  
-  
-__A__:  
-Currently, (September 1997) no, not without erasing all  
-data. A conversion utility to allow this does not yet exist.  
-The problem is that the actual structure and layout  
-of a RAID-5 array depends on the number of disks in the array.  
-Of course, one can add drives by backing up the array to tape,  
-deleting all data, creating a new array, and restoring from  
-tape.  
-  
-  
-***#  
-  
-***#__Q__:  
-What would happen to my RAID1/RAID0 sets if I shift one  
-of the drives from being /dev/hdb to /dev/hdc?  
-Because of cabling/case size/stupidity issues, I had to  
-make my RAID sets on the same IDE controller (/dev/hda  
-and /dev/hdb). Now that I've fixed some stuff, I want  
-to move /dev/hdb to /dev/hdc.  
-What would happen if I just change the /etc/mdtab and  
-/etc/raid1.conf files to reflect the new location?  
-  
-__A__:  
-For RAID-/linear, one must be careful to specify the  
-drives in exactly the same order. Thus, in the above  
-example, if the original config is  
-  
-  
-mdadd /dev/md0 /dev/hda /dev/hdb  
-  
-  
-Then the new config *must* be  
-  
-  
-mdadd /dev/md0 /dev/hda /dev/hdc  
-  
-  
-For RAID-1/4/5, the drive's ''RAID number'' is stored in  
-its RAID superblock, and therefore the order in which the  
-disks are specified is not important.  
-RAID-/linear does not have a superblock due to it's older  
-design, and the desire to maintain backwards compatibility  
-with this older design.  
-  
-  
-***#  
-  
-***#__Q__:  
-Can I convert a two-disk RAID-1 mirror to a three-disk RAID-5 array?  
-  
-__A__:  
-Yes. Michael at !BizSystems has come up with a clever,  
-sneaky way of doing this. However, like virtually all  
-manipulations of RAID arrays once they have data on  
-them, it is dangerous and prone to human error.  
-__Make a backup before you start__.  
-  
-I will make the following assumptions:  
----------------------------------------------  
-disks  
-original: hda - hdc  
-raid1 partitions hda3 - hdc3  
-array name /dev/md0  
-new hda - hdc - hdd  
-raid5 partitions hda3 - hdc3 - hdd3  
-array name: /dev/md1  
-You must substitute the appropriate disk and partition numbers for  
-you system configuration. This will hold true for all config file  
-examples.  
---------------------------------------------  
-DO A BACKUP BEFORE YOU DO ANYTHING  
-1) recompile kernel to include both raid1 and raid5  
-2) install new kernel and verify that raid personalities are present  
-3) disable the redundant partition on the raid 1 array. If this is a  
-root mounted partition (mine was) you must be more careful.  
-Reboot the kernel without starting raid devices or boot from rescue  
-system ( raid tools must be available )  
-start non-redundant raid1  
-mdadd -r -p1 /dev/md0 /dev/hda3  
-4) configure raid5 but with 'funny' config file, note that there is  
-no hda3 entry and hdc3 is repeated. This is needed since the  
-raid tools don't want you to do this.  
--------------------------------  
-# raid-5 configuration  
-raiddev /dev/md1  
-raid-level 5  
-nr-raid-disks 3  
-chunk-size 32  
-# Parity placement algorithm  
-parity-algorithm left-symmetric  
-# Spare disks for hot reconstruction  
-nr-spare-disks  
-device /dev/hdc3  
-raid-disk  
-device /dev/hdc3  
-raid-disk 1  
-device /dev/hdd3  
-raid-disk 2  
----------------------------------------  
-mkraid /etc/raid5.conf  
-5) activate the raid5 array in non-redundant mode  
-mdadd -r -p5 -c32k /dev/md1 /dev/hdc3 /dev/hdd3  
-6) make a file system on the array  
-mke2fs -b {blocksize} /dev/md1  
-recommended blocksize by some is 4096 rather than the default 1024.  
-this improves the memory utilization for the kernel raid routines and  
-matches the blocksize to the page size. I compromised and used 2048  
-since I have a relatively high number of small files on my system.  
-7) mount the two raid devices somewhere  
-mount -t ext2 /dev/md0 mnt0  
-mount -t ext2 /dev/md1 mnt1  
-8) move the data  
-cp -a mnt0 mnt1  
-9) verify that the data sets are identical  
-10) stop both arrays  
-11) correct the information for the raid5.conf file  
-change /dev/md1 to /dev/md0  
-change the first disk to read /dev/hda3  
-12) upgrade the new array to full redundant status  
-(THIS DESTROYS REMAINING raid1 INFORMATION)  
-ckraid --fix /etc/raid5.conf  
-  
-  
-  
-***#  
-  
-----  
-  
-!!8. Performance, Tools & General Bone-headed Questions  
-  
-  
-  
-  
-  
-***#__Q__:  
-I've created a RAID-0 device on /dev/sda2 and  
-/dev/sda3. The device is a lot slower than a  
-single partition. Isn't md a pile of junk?  
-  
-__A__:  
-To have a RAID-0 device running a full speed, you must  
-have partitions from different disks. Besides, putting  
-the two halves of the mirror on the same disk fails to  
-give you any protection whatsoever against disk failure.  
-  
-  
-***#  
-  
-***#__Q__:  
-What's the use of having RAID-linear when RAID-0 will do the  
-same thing, but provide higher performance?  
-  
-__A__:  
-It's not obvious that RAID-0 will always provide better  
-performance; in fact, in some cases, it could make things  
-worse.  
-The ext2fs file system scatters files all over a partition,  
-and it attempts to keep all of the blocks of a file  
-contiguous, basically in an attempt to prevent fragmentation.  
-Thus, ext2fs behaves "as if" there were a (variable-sized)  
-stripe per file. If there are several disks concatenated  
-into a single RAID-linear, this will result files being  
-statistically distributed on each of the disks. Thus,  
-at least for ext2fs, RAID-linear will behave a lot like  
-RAID-0 with large stripe sizes. Conversely, RAID-  
-with small stripe sizes can cause excessive disk activity  
-leading to severely degraded performance if several large files  
-are accessed simultaneously.  
-  
-  
-In many cases, RAID-0 can be an obvious win. For example,  
-imagine a large database file. Since ext2fs attempts to  
-cluster together all of the blocks of a file, chances  
-are good that it will end up on only one drive if RAID-linear  
-is used, but will get chopped into lots of stripes if RAID-0 is  
-used. Now imagine a number of (kernel) threads all trying  
-to random access to this database. Under RAID-linear, all  
-accesses would go to one disk, which would not be as efficient  
-as the parallel accesses that RAID-0 entails.  
-  
-  
-***#  
-  
-***#__Q__:  
-How does RAID-0 handle a situation where the different stripe  
-partitions are different sizes? Are the stripes uniformly  
-distributed?  
-  
-__A__:  
-To understand this, lets look at an example with three  
-partitions; one that is 50MB, one 90MB and one 125MB.  
-Lets call D0 the 50MB disk, D1 the 90MB disk and D2 the 125MB  
-disk. When you start the device, the driver calculates 'strip  
-zones'. In this case, it finds 3 zones, defined like this:  
-  
-Z0 : (D0/D1/D2) 3 x 50 = 150MB total in this zone  
-Z1 : (D1/D2) 2 x 40 = 80MB total in this zone  
-Z2 : (D2) 125-50-40 = 35MB total in this zone.  
-  
-You can see that the total size of the zones is the size of the  
-virtual device, but, depending on the zone, the striping is  
-different. Z2 is rather inefficient, since there's only one  
-disk.  
-Since ext2fs and most other Unix  
-file systems distribute files all over the disk, you  
-have a 35/265 = 13% chance that a fill will end up  
-on Z2, and not get any of the benefits of striping.  
-(DOS tries to fill a disk from beginning to end, and thus,  
-the oldest files would end up on Z0. However, this  
-strategy leads to severe filesystem fragmentation,  
-which is why no one besides DOS does it this way.)  
-  
-  
-***#  
-  
-***#__Q__:  
-I have some Brand X hard disks and a Brand Y controller.  
-and am considering using md.  
-Does it significantly increase the throughput?  
-Is the performance really noticeable?  
-  
-__A__:  
-The answer depends on the configuration that you use.  
-  
-  
-  
-  
-; __Linux MD RAID-0 and RAID-linear performance:__:  
-  
-If the system is heavily loaded with lots of I/O,  
-statistically, some of it will go to one disk, and  
-some to the others. Thus, performance will improve  
-over a single large disk. The actual improvement  
-depends a lot on the actual data, stripe sizes, and  
-other factors. In a system with low I/O usage,  
-the performance is equal to that of a single disk.  
-  
-  
-  
-  
-  
-  
-; __Linux MD RAID-1 (mirroring) read performance:__:  
-  
-MD implements read balancing. That is, the RAID-1  
-code will alternate between each of the (two or more)  
-disks in the mirror, making alternate reads to each.  
-In a low-I/O situation, this won't change performance  
-at all: you will have to wait for one disk to complete  
-the read.  
-But, with two disks in a high-I/O environment,  
-this could as much as double the read performance,  
-since reads can be issued to each of the disks in parallel.  
-For N disks in the mirror, this could improve performance  
-N-fold.  
-  
-  
-  
-; __Linux MD RAID-1 (mirroring) write performance:__:  
-  
-Must wait for the write to occur to all of the disks  
-in the mirror. This is because a copy of the data  
-must be written to each of the disks in the mirror.  
-Thus, performance will be roughly equal to the write  
-performance to a single disk.  
-  
-  
-  
-; __Linux MD RAID-4/5 read performance:__:  
-  
-Statistically, a given block can be on any one of a number  
-of disk drives, and thus RAID-4/5 read performance is  
-a lot like that for RAID-. It will depend on the data, the  
-stripe size, and the application. It will not be as good  
-as the read performance of a mirrored array.  
-  
-  
-  
-; __Linux MD RAID-4/5 write performance:__:  
-  
-This will in general be considerably slower than that for  
-a single disk. This is because the parity must be written  
-out to one drive as well as the data to another. However,  
-in order to compute the new parity, the old parity and  
-the old data must be read first. The old data, new data and  
-old parity must all be XOR'ed together to determine the new  
-parity: this requires considerable CPU cycles in addition  
-to the numerous disk accesses.  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-What RAID configuration should I use for optimal performance?  
-  
-__A__:  
-Is the goal to maximize throughput, or to minimize latency?  
-There is no easy answer, as there are many factors that  
-affect performance:  
-  
-  
-***#*operating system - will one process/thread, or many  
-be performing disk access?  
-***#*  
-  
-***#*application - is it accessing data in a  
-sequential fashion, or random access?  
-***#*  
-  
-***#*file system - clusters files or spreads them out  
-(the ext2fs clusters together the blocks of a file,  
-and spreads out files)  
-***#*  
-  
-***#*disk driver - number of blocks to read ahead  
-(this is a tunable parameter)  
-***#*  
-  
-***#*CEC hardware - one drive controller, or many?  
-***#*  
-  
-***#*hd controller - able to queue multiple requests or not?  
-Does it provide a cache?  
-***#*  
-  
-***#*hard drive - buffer cache memory size -- is it big  
-enough to handle the write sizes and rate you want?  
-***#*  
-  
-***#*physical platters - blocks per cylinder -- accessing  
-blocks on different cylinders will lead to seeks.  
-***#*  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-What is the optimal RAID-5 configuration for performance?  
-  
-__A__:  
-Since RAID-5 experiences an I/O load that is equally  
-distributed  
-across several drives, the best performance will be  
-obtained when the RAID set is balanced by using  
-identical drives, identical controllers, and the  
-same (low) number of drives on each controller.  
-Note, however, that using identical components will  
-raise the probability of multiple simultaneous failures,  
-for example due to a sudden jolt or drop, overheating,  
-or a power surge during an electrical storm. Mixing  
-brands and models helps reduce this risk.  
-  
-  
-***#  
-  
-***#__Q__:  
-What is the optimal block size for a RAID-4/5 array?  
-  
-__A__:  
-When using the current (November 1997) RAID-4/5  
-implementation, it is strongly recommended that  
-the file system be created with mke2fs -b 4096  
-instead of the default 1024 byte filesystem block size.  
-  
-  
-This is because the current RAID-5 implementation  
-allocates one 4K memory page per disk block;  
-if a disk block were just 1K in size, then  
-75% of the memory which RAID-5 is allocating for  
-pending I/O would not be used. If the disk block  
-size matches the memory page size, then the  
-driver can (potentially) use all of the page.  
-Thus, for a filesystem with a 4096 block size as  
-opposed to a 1024 byte block size, the RAID driver  
-will potentially queue 4 times as much  
-pending I/O to the low level drivers without  
-allocating additional memory.  
-  
-  
-  
-  
-  
-__Note__: the above remarks do NOT apply to Software  
-RAID-/1/linear driver.  
-  
-  
-  
-  
-  
-__Note:__ the statements about 4K memory page size apply to the  
-Intel x86 architecture. The page size on Alpha, Sparc, and other  
-CPUS are different; I believe they're 8K on Alpha/Sparc (????).  
-Adjust the above figures accordingly.  
-  
-  
-  
-  
-  
-__Note:__ if your file system has a lot of small  
-files (files less than 10KBytes in size), a considerable  
-fraction of the disk space might be wasted. This is  
-because the file system allocates disk space in multiples  
-of the block size. Allocating large blocks for small files  
-clearly results in a waste of disk space: thus, you may  
-want to stick to small block sizes, get a larger effective  
-storage capacity, and not worry about the "wasted" memory  
-due to the block-size/page-size mismatch.  
-  
-  
-  
-  
-  
-__Note:__ most ''typical'' systems do not have that many  
-small files. That is, although there might be thousands  
-of small files, this would lead to only some 10 to 100MB  
-wasted space, which is probably an acceptable tradeoff for  
-performance on a multi-gigabyte disk.  
-  
-  
-However, for news servers, there might be tens or hundreds  
-of thousands of small files. In such cases, the smaller  
-block size, and thus the improved storage capacity,  
-may be more important than the more efficient I/O  
-scheduling.  
-  
-  
-  
-  
-  
-__Note:__ there exists an experimental file system for Linux  
-which packs small files and file chunks onto a single block.  
-It apparently has some very positive performance  
-implications when the average file size is much smaller than  
-the block size.  
-  
-  
-  
-  
-  
-Note: Future versions may implement schemes that obsolete  
-the above discussion. However, this is difficult to  
-implement, since dynamic run-time allocation can lead to  
-dead-locks; the current implementation performs a static  
-pre-allocation.  
-  
-  
-***#  
-  
-***#__Q__:  
-How does the chunk size (stripe size) influence the speed of  
-my RAID-, RAID-4 or RAID-5 device?  
-  
-__A__:  
-The chunk size is the amount of data contiguous on the  
-virtual device that is also contiguous on the physical  
-device. In this HOWTO, "chunk" and "stripe" refer to  
-the same thing: what is commonly called the "stripe"  
-in other RAID documentation is called the "chunk"  
-in the MD man pages. Stripes or chunks apply only to  
-RAID , 4 and 5, since stripes are not used in  
-mirroring (RAID-1) and simple concatenation (RAID-linear).  
-The stripe size affects both read and write latency (delay),  
-throughput (bandwidth), and contention between independent  
-operations (ability to simultaneously service overlapping I/O  
-requests).  
-  
-  
-Assuming the use of the ext2fs file system, and the current  
-kernel policies about read-ahead, large stripe sizes are almost  
-always better than small stripe sizes, and stripe sizes  
-from about a fourth to a full disk cylinder in size  
-may be best. To understand this claim, let us consider the  
-effects of large stripes on small files, and small stripes  
-on large files. The stripe size does  
-not affect the read performance of small files: For an  
-array of N drives, the file has a 1/N probability of  
-being entirely within one stripe on any one of the drives.  
-Thus, both the read latency and bandwidth will be comparable  
-to that of a single drive. Assuming that the small files  
-are statistically well distributed around the filesystem,  
-(and, with the ext2fs file system, they should be), roughly  
-N times more overlapping, concurrent reads should be possible  
-without significant collision between them. Conversely, if  
-very small stripes are used, and a large file is read sequentially,  
-then a read will issued to all of the disks in the array.  
-For a the read of a single large file, the latency will almost  
-double, as the probability of a block being 3/4'ths of a  
-revolution or farther away will increase. Note, however,  
-the trade-off: the bandwidth could improve almost N-fold  
-for reading a single, large file, as N drives can be reading  
-simultaneously (that is, if read-ahead is used so that all  
-of the disks are kept active). But there is another,  
-counter-acting trade-off: if all of the drives are already busy  
-reading one file, then attempting to read a second or third  
-file at the same time will cause significant contention,  
-ruining performance as the disk ladder algorithms lead to  
-seeks all over the platter. Thus, large stripes will almost  
-always lead to the best performance. The sole exception is  
-the case where one is streaming a single, large file at a  
-time, and one requires the top possible bandwidth, and one  
-is also using a good read-ahead algorithm, in which case small  
-stripes are desired.  
-  
-  
-  
-  
-  
-Note that this HOWTO previously recommended small stripe  
-sizes for news spools or other systems with lots of small  
-files. This was bad advice, and here's why: news spools  
-contain not only many small files, but also large summary  
-files, as well as large directories. If the summary file  
-is larger than the stripe size, reading it will cause  
-many disks to be accessed, slowing things down as each  
-disk performs a seek. Similarly, the current ext2fs  
-file system searches directories in a linear, sequential  
-fashion. Thus, to find a given file or inode, on average  
-half of the directory will be read. If this directory is  
-spread across several stripes (several disks), the  
-directory read (e.g. due to the ls command) could get  
-very slow. Thanks to Steven A. Reisman  
-<  
-sar@pressenter.com> for this correction.  
-Steve also adds:  
-  
-I found that using a 256k stripe gives much better performance.  
-I suspect that the optimum size would be the size of a disk  
-cylinder (or maybe the size of the disk drive's sector cache).  
-However, disks nowadays have recording zones with different  
-sector counts (and sector caches vary among different disk  
-models). There's no way to guarantee stripes won't cross a  
-cylinder boundary.  
-  
-  
-  
-  
-  
-  
-  
-  
-  
-The tools accept the stripe size specified in KBytes.  
-You'll want to specify a multiple of if the page size  
-for your CPU (4KB on the x86).  
-  
-  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-What is the correct stride factor to use when creating the  
-ext2fs file system on the RAID partition? By stride, I mean  
-the -R flag on the mke2fs command:  
-  
-mke2fs -b 4096 -R stride=nnn ...  
-  
-What should the value of nnn be?  
-  
-__A__:  
-The -R stride flag is used to tell the file system  
-about the size of the RAID stripes. Since only RAID-,4 and 5  
-use stripes, and RAID-1 (mirroring) and RAID-linear do not,  
-this flag is applicable only for RAID-,4,5.  
-Knowledge of the size of a stripe allows mke2fs  
-to allocate the block and inode bitmaps so that they don't  
-all end up on the same physical drive. An unknown contributor  
-wrote:  
-  
-I noticed last spring that one drive in a pair always had a  
-larger I/O count, and tracked it down to the these meta-data  
-blocks. Ted added the -R stride= option in response  
-to my explanation and request for a workaround.  
-  
-For a 4KB block file system, with stripe size 256KB, one would  
-use -R stride=64.  
-  
-  
-If you don't trust the -R flag, you can get a similar  
-effect in a different way. Steven A. Reisman  
-<  
-sar@pressenter.com> writes:  
-  
-Another consideration is the filesystem used on the RAID-0 device.  
-The ext2 filesystem allocates 8192 blocks per group. Each group  
-has its own set of inodes. If there are 2, 4 or 8 drives, these  
-inodes cluster on the first disk. I've distributed the inodes  
-across all drives by telling mke2fs to allocate only 7932 blocks  
-per group.  
-  
-Some mke2fs pages do not describe the [[-g blocks-per-group]  
-flag used in this operation.  
-  
-  
-***#  
-  
-***#__Q__:  
-Where can I put the md commands in the startup scripts,  
-so that everything will start automatically at boot time?  
-  
-__A__:  
-Rod Wilkens  
-<  
-rwilkens@border.net>  
-writes:  
-  
-What I did is put ``mdadd -ar'' in  
-the ``/etc/rc.d/rc.sysinit'' right after the kernel  
-loads the modules, and before the ``fsck'' disk check.  
-This way, you can put the ``/dev/md?'' device in the  
-``/etc/fstab''. Then I put the ``mdstop -a''  
-right after the ``umount -a'' unmounting the disks,  
-in the ``/etc/rc.d/init.d/halt'' file.  
-  
-For raid-5, you will want to look at the return code  
-for mdadd, and if it failed, do a  
-  
-  
-ckraid --fix /etc/raid5.conf  
-  
-  
-to repair any damage.  
-  
-  
-***#  
-  
-***#__Q__:  
-I was wondering if it's possible to setup striping with more  
-than 2 devices in md0? This is for a news server,  
-and I have 9 drives... Needless to say I need much more than two.  
-Is this possible?  
-  
-__A__:  
-Yes. (describe how to do this)  
-  
-  
-***#  
-  
-***#__Q__:  
-When is Software RAID superior to Hardware RAID?  
-  
-__A__:  
-Normally, Hardware RAID is considered superior to Software  
-RAID, because hardware controllers often have a large cache,  
-and can do a better job of scheduling operations in parallel.  
-However, integrated Software RAID can (and does) gain certain  
-advantages from being close to the operating system.  
-  
-  
-For example, ... ummm. Opaque description of caching of  
-reconstructed blocks in buffer cache elided ...  
-  
-  
-  
-  
-  
-On a dual PPro SMP system, it has been reported that  
-Software-RAID performance exceeds the performance of a  
-well-known hardware-RAID board vendor by a factor of  
-2 to 5.  
-  
-  
-  
-  
-  
-Software RAID is also a very interesting option for  
-high-availability redundant server systems. In such  
-a configuration, two CPU's are attached to one set  
-or SCSI disks. If one server crashes or fails to  
-respond, then the other server can mdadd,  
-mdrun and mount the software RAID  
-array, and take over operations. This sort of dual-ended  
-operation is not always possible with many hardware  
-RAID controllers, because of the state configuration that  
-the hardware controllers maintain.  
-  
-  
-***#  
-  
-***#__Q__:  
-If I upgrade my version of raidtools, will it have trouble  
-manipulating older raid arrays? In short, should I recreate my  
-RAID arrays when upgrading the raid utilities?  
-  
-__A__:  
-No, not unless the major version number changes.  
-An MD version x.y.z consists of three sub-versions:  
-  
-x: Major version.  
-y: Minor version.  
-z: Patchlevel version.  
-  
-Version x1.y1.z1 of the RAID driver supports a RAID array with  
-version x2.y2.z2 in case (x1 == x2) and (y1 >= y2).  
-Different patchlevel (z) versions for the same (x.y) version are  
-designed to be mostly compatible.  
-  
-  
-The minor version number is increased whenever the RAID array layout  
-is changed in a way which is incompatible with older versions of the  
-driver. New versions of the driver will maintain compatibility with  
-older RAID arrays.  
-  
-  
-The major version number will be increased if it will no longer make  
-sense to support old RAID arrays in the new kernel code.  
-  
-  
-  
-  
-  
-For RAID-1, it's not likely that the disk layout nor the  
-superblock structure will change anytime soon. Most all  
-Any optimization and new features (reconstruction, multithreaded  
-tools, hot-plug, etc.) doesn't affect the physical layout.  
-  
-  
-***#  
-  
-***#__Q__:  
-The command mdstop /dev/md0 says that the device is busy.  
-  
-__A__:  
-There's a process that has a file open on /dev/md0, or  
-/dev/md0 is still mounted. Terminate the process or  
-umount /dev/md0.  
-  
-  
-***#  
-  
-***#__Q__:  
-Are there performance tools?  
-  
-__A__:  
-There is also a new utility called iotrace in the  
-linux/iotrace  
-directory. It reads /proc/io-trace and analyses/plots it's  
-output. If you feel your system's block IO performance is too  
-low, just look at the iotrace output.  
-  
-  
-***#  
-  
-***#__Q__:  
-I was reading the RAID source, and saw the value  
-SPEED_LIMIT defined as 1024K/sec. What does this mean?  
-Does this limit performance?  
-  
-__A__:  
-SPEED_LIMIT is used to limit RAID reconstruction  
-speed during automatic reconstruction. Basically, automatic  
-reconstruction allows you to e2fsck and  
-mount immediately after an unclean shutdown,  
-without first running ckraid. Automatic  
-reconstruction is also used after a failed hard drive  
-has been replaced.  
-  
-  
-In order to avoid overwhelming the system while  
-reconstruction is occurring, the reconstruction thread  
-monitors the reconstruction speed and slows it down if  
-its too fast. The 1M/sec limit was arbitrarily chosen  
-as a reasonable rate which allows the reconstruction to  
-finish reasonably rapidly, while creating only a light load  
-on the system so that other processes are not interfered with.  
-  
-  
-***#  
-  
-***#__Q__:  
-What about ''spindle synchronization'' or ''disk  
-synchronization''?  
-  
-__A__:  
-Spindle synchronization is used to keep multiple hard drives  
-spinning at exactly the same speed, so that their disk  
-platters are always perfectly aligned. This is used by some  
-hardware controllers to better organize disk writes.  
-However, for software RAID, this information is not used,  
-and spindle synchronization might even hurt performance.  
-  
-  
-***#  
-  
-***#__Q__:  
-How can I set up swap spaces using raid ?  
-Wouldn't striped swap ares over 4+ drives be really fast?  
-  
-__A__:  
-Leonard N. Zubkoff replies:  
-It is really fast, but you don't need to use MD to get striped  
-swap. The kernel automatically stripes across equal priority  
-swap spaces. For example, the following entries from  
-/etc/fstab stripe swap space across five drives in  
-three groups:  
-  
-/dev/sdg1 swap swap pri=3  
-/dev/sdk1 swap swap pri=3  
-/dev/sdd1 swap swap pri=3  
-/dev/sdh1 swap swap pri=3  
-/dev/sdl1 swap swap pri=3  
-/dev/sdg2 swap swap pri=2  
-/dev/sdk2 swap swap pri=2  
-/dev/sdd2 swap swap pri=2  
-/dev/sdh2 swap swap pri=2  
-/dev/sdl2 swap swap pri=2  
-/dev/sdg3 swap swap pri=1  
-/dev/sdk3 swap swap pri=1  
-/dev/sdd3 swap swap pri=1  
-/dev/sdh3 swap swap pri=1  
-/dev/sdl3 swap swap pri=1  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-I want to maximize performance. Should I use multiple  
-controllers?  
-  
-__A__:  
-In many cases, the answer is yes. Using several  
-controllers to perform disk access in parallel will  
-improve performance. However, the actual improvement  
-depends on your actual configuration. For example,  
-it has been reported (Vaughan Pratt, January 98) that  
-a single 4.3GB Cheetah attached to an Adaptec 2940UW  
-can achieve a rate of 14MB/sec (without using RAID).  
-Installing two disks on one controller, and using  
-a RAID-0 configuration results in a measured performance  
-of 27 MB/sec.  
-  
-  
-Note that the 2940UW controller is an "Ultra-Wide"  
-SCSI controller, capable of a theoretical burst rate  
-of 40MB/sec, and so the above measurements are not  
-surprising. However, a slower controller attached  
-to two fast disks would be the bottleneck. Note also,  
-that most out-board SCSI enclosures (e.g. the kind  
-with hot-pluggable trays) cannot be run at the 40MB/sec  
-rate, due to cabling and electrical noise problems.  
-  
-  
-  
-  
-  
-If you are designing a multiple controller system,  
-remember that most disks and controllers typically  
-run at 70-85% of their rated max speeds.  
-  
-  
-  
-  
-  
-Note also that using one controller per disk  
-can reduce the likelihood of system outage  
-due to a controller or cable failure (In theory --  
-only if the device driver for the controller can  
-gracefully handle a broken controller. Not all  
-SCSI device drivers seem to be able to handle such  
-a situation without panicking or otherwise locking up).  
-  
-  
-***#  
-  
-----  
-  
-!!9. High Availability RAID  
-  
-  
-  
-  
-  
-***#__Q__:  
-RAID can help protect me against data loss. But how can I also  
-ensure that the system is up as long as possible, and not prone  
-to breakdown? Ideally, I want a system that is up 24 hours a  
-day, 7 days a week, 365 days a year.  
-  
-__A__:  
-High-Availability is difficult and expensive. The harder  
-you try to make a system be fault tolerant, the harder  
-and more expensive it gets. The following hints, tips,  
-ideas and unsubstantiated rumors may help you with this  
-quest.  
-  
-  
-***#*IDE disks can fail in such a way that the failed disk  
-on an IDE ribbon can also prevent the good disk on the  
-same ribbon from responding, thus making it look as  
-if two disks have failed. Since RAID does not  
-protect against two-disk failures, one should either  
-put only one disk on an IDE cable, or if there are two  
-disks, they should belong to different RAID sets.  
-***#*  
-  
-***#*SCSI disks can fail in such a way that the failed disk  
-on a SCSI chain can prevent any device on the chain  
-from being accessed. The failure mode involves a  
-short of the common (shared) device ready pin;  
-since this pin is shared, no arbitration can occur  
-until the short is removed. Thus, no two disks on the  
-same SCSI chain should belong to the same RAID array.  
-***#*  
-  
-***#*Similar remarks apply to the disk controllers.  
-Don't load up the channels on one controller; use  
-multiple controllers.  
-***#*  
-  
-***#*Don't use the same brand or model number for all of  
-the disks. It is not uncommon for severe electrical  
-storms to take out two or more disks. (Yes, we  
-all use surge suppressors, but these are not perfect  
-either). Heat & poor ventilation of the disk  
-enclosure are other disk killers. Cheap disks  
-often run hot.  
-Using different brands of disk & controller  
-decreases the likelihood that whatever took out one disk  
-(heat, physical shock, vibration, electrical surge)  
-will also damage the others on the same date.  
-***#*  
-  
-***#*To guard against controller or CPU failure,  
-it should be possible to build a SCSI disk enclosure  
-that is "twin-tailed": i.e. is connected to two  
-computers. One computer will mount the file-systems  
-read-write, while the second computer will mount them  
-read-only, and act as a hot spare. When the hot-spare  
-is able to determine that the master has failed (e.g.  
-through a watchdog), it will cut the power to the  
-master (to make sure that it's really off), and then  
-fsck & remount read-write. If anyone gets  
-this working, let me know.  
-***#*  
-  
-***#*Always use an UPS, and perform clean shutdowns.  
-Although an unclean shutdown may not damage the disks,  
-running ckraid on even small-ish arrays is painfully  
-slow. You want to avoid running ckraid as much as  
-possible. Or you can hack on the kernel and get the  
-hot-reconstruction code debugged ...  
-***#*  
-  
-***#*SCSI cables are well-known to be very temperamental  
-creatures, and prone to cause all sorts of problems.  
-Use the highest quality cabling that you can find for  
-sale. Use e.g. bubble-wrap to make sure that ribbon  
-cables to not get too close to one another and  
-cross-talk. Rigorously observe cable-length  
-restrictions.  
-***#*  
-  
-***#*Take a look at SSI (Serial Storage Architecture).  
-Although it is rather expensive, it is rumored  
-to be less prone to the failure modes that SCSI  
-exhibits.  
-***#*  
-  
-***#*Enjoy yourself, its later than you think.  
-***#*  
-  
-  
-  
-***#  
-  
-----  
-  
-!!10. Questions Waiting for Answers  
-  
-  
-  
-  
-  
-***#__Q__:  
-If, for cost reasons, I try to mirror a slow disk with a fast disk,  
-is the S/W smart enough to balance the reads accordingly or will it  
-all slow down to the speed of the slowest?  
-  
-  
-  
-  
-***#  
-  
-***#__Q__:  
-For testing the raw disk thru put...  
-is there a character device for raw read/raw writes instead of  
-/dev/sdaxx that we can use to measure performance  
-on the raid drives??  
-is there a GUI based tool to use to watch the disk thru-put??  
-  
-  
-  
-  
-***#  
-  
-----  
-  
-!!11. Wish List of Enhancements to MD and Related Software  
-  
-  
-Bradley Ward Allen  
-<  
-ulmo@Q.Net>  
-wrote:  
-  
-Ideas include:  
-  
-  
-****Boot-up parameters to tell the kernel which devices are  
-to be MD devices (no more ``mdadd'')  
-****  
-  
-****Making MD transparent to ``mount''/``umount''  
-such that there is no ``mdrun'' and ``mdstop''  
-****  
-  
-****Integrating ``ckraid'' entirely into the kernel,  
-and letting it run as needed  
-****  
-  
-(So far, all I've done is suggest getting rid of the tools and putting  
-them into the kernel; that's how I feel about it,  
-this is a filesystem, not a toy.)  
-  
-  
-****Deal with arrays that can easily survive N disks going out  
-simultaneously or at separate moments,  
-where N is a whole number > 0 settable by the administrator  
-****  
-  
-****Handle kernel freezes, power outages,  
-and other abrupt shutdowns better  
-****  
-  
-****Don't disable a whole disk if only parts of it have failed,  
-e.g., if the sector errors are confined to less than 50% of  
-access over the attempts of 20 dissimilar requests,  
-then it continues just ignoring those sectors of that particular  
-disk.  
-****  
-  
-****Bad sectors:  
-  
-  
-*****A mechanism for saving which sectors are bad,  
-someplace onto the disk.  
-*****  
-  
-*****If there is a generalized mechanism for marking degraded  
-bad blocks that upper filesystem levels can recognize,  
-use that. Program it if not.  
-*****  
-  
-*****Perhaps alternatively a mechanism for telling the upper  
-layer that the size of the disk got smaller,  
-even arranging for the upper layer to move out stuff from  
-the areas being eliminated.  
-This would help with a degraded blocks as well.  
-*****  
-  
-*****Failing the above ideas, keeping a small (admin settable)  
-amount of space aside for bad blocks (distributed evenly  
-across disk?), and using them (nearby if possible)  
-instead of the bad blocks when it does happen.  
-Of course, this is inefficient.  
-Furthermore, the kernel ought to log every time the RAID  
-array starts each bad sector and what is being done about  
-it with a ``crit'' level warning, just to get  
-the administrator to realize that his disk has a piece of  
-dust burrowing into it (or a head with platter sickness).  
-*****  
-  
-  
-****  
-  
-****Software-switchable disks:  
-  
-; __``disable this disk''__:  
-  
-would block until kernel has completed making sure  
-there is no data on the disk being shut down  
-that is needed (e.g., to complete an XOR/ECC/other error  
-correction), then release the disk from use  
-(so it could be removed, etc.);  
-; __``enable this disk''__:  
-  
-would mkraid a new disk if appropriate  
-and then start using it for ECC/whatever operations,  
-enlarging the RAID5 array as it goes;  
-; __``resize array''__:  
-  
-would respecify the total number of disks  
-and the number of redundant disks, and the result  
-would often be to resize the size of the array;  
-where no data loss would result,  
-doing this as needed would be nice,  
-but I have a hard time figuring out how it would do that;  
-in any case, a mode where it would block  
-(for possibly hours (kernel ought to log something every  
-ten seconds if so)) would be necessary;  
-; __``enable this disk while saving data''__:  
-  
-which would save the data on a disk as-is and move it  
-to the RAID5 system as needed, so that a horrific save  
-and restore would not have to happen every time someone  
-brings up a RAID5 system (instead, it may be simpler to  
-only save one partition instead of two,  
-it might fit onto the first as a gzip'd file even);  
-finally,  
-; __``re-enable disk''__:  
-  
-would be an operator's hint to the OS to try out  
-a previously failed disk (it would simply call disable  
-then enable, I suppose).  
-  
-  
-****  
-  
-  
-  
-  
-Other ideas off the net:  
-  
-  
-  
-****finalrd analog to initrd, to simplify root raid.  
-****  
-  
-****a read-only raid mode, to simplify the above  
-****  
-  
-****Mark the RAID set as clean whenever there are no  
-"half writes" done. -- That is, whenever there are no write  
-transactions that were committed on one disk but still  
-unfinished on another disk.  
-Add a "write inactivity" timeout (to avoid frequent seeks  
-to the RAID superblock when the RAID set is relatively  
-busy) .  
-  
-****  
-  
-  
-  
-  
-  
-----  
+Describe [HowToSoftwareRAID0 .4xHOWTO ] here.