首页 A Case for Redundant Arrays of Inexpensive Disks (RAID)

A Case for Redundant Arrays of Inexpensive Disks (RAID)


A Case for Redundant Arrays of Inexpensive Disks (RAID) A Case for Redundant Arrays of Inexpensive Disks (RAID) Davtd A Patterson, Garth Gibson, and Randy H Katz Computer Saence D~v~smn Department of Elecmcal Engmeermg and Computer Sclencea 571 Evans Hall Umversity of Cabforma Berkeley. CA 94720 (partrs...

A Case for Redundant Arrays of Inexpensive Disks (RAID)
A Case for Redundant Arrays of Inexpensive Disks (RAID) Davtd A Patterson, Garth Gibson, and Randy H Katz Computer Saence D~v~smn Department of Elecmcal Engmeermg and Computer Sclencea 571 Evans Hall Umversity of Cabforma Berkeley. CA 94720 (partrsl@WF -kY du) Abstract Increasmg performance of CPUs and memorres wrll be squandered lf not matched by a sunrlm peformance ourease m II0 Whde the capactty of Smgle Large Expenstve D&T (SLED) has grown rapuily, the performance rmprovement of SLED has been modest Redundant Arrays of Inexpensive Disks (RAID), based on the magnetic duk technology developed for personal computers, offers an attractive alternattve IO SLED, promtang onprovements of an or&r of mogm&e m pctformance, rehabdlty, power consumption, and scalalnlrty Thu paper rntroducesfivc levels of RAIDS, grvmg rheu relative costlpetfotmance, and compares RAID to an IBM 3380 and a Fupisu Super Eagle 1 Background: Rlsrng CPU and Memory Performance The users of computers are currently enJoymg unprecedented growth m the speed of computers Gordon Bell said that between 1974 and 1984. smgle chip computers improved m performance by 40% per year, about twice the rate of mmlcomputers [Bell 841 In the followmg year B111 Joy predicted an even faster growth [Joy 851 Mamframe and supercomputer manufacturers, havmg &fficulty keeping pace with the rapId growth predicted by “Joy’s Law,” cope by offermg m&processors as theu top-of-the-lme product. But a fast CPU does not a fast system make Gene Amdahl related CPU speed to mam memory s12e usmg this rule [Siewmrek 821 Each CPU mnstrucaon per second requues one byte of moan memory, If computer system costs are not to be dommated by the cost of memory, then Amdahl’s constant suggests that memory chip capacity should grow at the same rate Gordon Moore pr&cted that growth rate over 20 years fransuforslclup = 2y*-1%4 AK predzted by Moore’s Law, RAMs have quadrupled m capacity every twotMoom75110threeyeaFIyers861 Recently the rauo of megabytes of mam memory to MIPS ha9 been defti as ahha [Garcm 841. vvlth Amdahl’s constant meanmg alpha = 1 In parl because of the rapti drop of memory prices, mam memory we.9 have grownfastexthanCPUspeedsandmanymachmesare~ppedtoday~th alphas of 3 or tigha To mamtam the balance of costs m computer systems, secondary storage must match the advances m other parts of the system A key meas- Pemuswn to copy mthout fee all or w of &IS matcnal IS granted pronded that the COP!S zzrc not made or lstnbuted for dwct commernal advantage, the ACM copyright notIce and the tltk of the pubbcatuon and IW da’, appear, and notxe IS @“en that COPYI"K IS by pemtrs~on of the Association for Computing Machtnery To COPY otherwIse, or to repubbsh, requres B fee and/or spenfic perm~ss~o” 0 1988 ACM 0-89791~268-3/88/~/OlOP $1 50 ure of magneuc tik technology 1s the growth m the maxnnum number of bits that can be stored per square mch, or the bits per mch m a track umes the number of tracks per mch Called MA D , for maxunal area1 density, the “Fmt Law m Disk Density” predicts ~rank87] MAD = lo(Year-1971)/10 Magnettc dd technology has doubled capacity and halved pnce every three years, m hne with the growth rate of semiconductor memory, and m practice between 1967 and 1979 the dtsk capacity of the average IBM data processmg system more than kept up with its mam memory [Stevens81 ] Capacity IS not the o~rty memory charactensuc that must grow rapidly to mamtam system balance, since the speed with which msuuctions and data are delivered to a CPU also determmes its ulamdte perfarmanceThespeedof~mem~has~tpacefoPtworeasons (1) the mvenuon of caches, showmg that a small buff= can be managed automamzally to contain a substanttal fractmn of memory refaences. (2) and the SRAM technology, used to build caches, whose speed has lmpmvedattherateof4O%tolOO%peryear In umtmst to pnmary memory technologres, the performance of single large expensive ma8netuz d&s (SLED) has improved at a modest rate These mechamcal devu~ are dommated by the seek and the rotahon delays from 1971 to 1981, the raw seek tune for a high-end IBM disk improved by only a factor of two whllt the rocstlon hme did not cbange[Harkex811 Greater denslty means a lugher transfer rate when the mformatmn 1s found. and extra heads can educe the aveaage seek tnne, but the raw seek hme only unproved at a rate of 7% per year There 1s no reasontoexpectafasterratemthenearfuture To mamtam balance, computer systems have been usmg even larger mam memones or solid state d&s to buffer some of the I/O acttvlty This may be a fine solutron for apphcattons whose I/O actrvlty has locality of reference and for which volatlltty 1s not an issue. but appbcauons dommated by a high rate of random muests for small peces of data (such BS tmmact~on-pmcessmg) or by a low number of requests for massive amounts of data (such as large simulahons nmnmg on supercomputers) are facmg a sermus p&mnance hmuatmn 2. The Pendrng I/O Crisw What t3 the Impact of lmprovmg the performance of sOme pieces of a problem while leavmg others the same? Amdahl’s answer IS now known asAmdahl'sLaw[Amdahl67] 1 S z (1-n +flk Whae S = the effecttve speedup, f=fractmnofworkmfastermode,and k = speedup whde m faster mode I/G Suppose that some current appbcatmns spend 10% of thev ume In Then when computers are 10X faster--accordmg to Bdl Joy m JUSt Over thtte years--then Amdahl’s Law predicts efQcove speedup wdl be only 5X When we have computers lOOX faster--vm evolutmn of umprcuzessors or by multiprocessors-&s applrcatlon will be less than 10X faster, wastmg 90% of the potenhal speedup Whde we can lmagme improvements m software file systems via buffcrmg for near term 40 demands, we need mnovaUon to avoid an J./O crms [Boral83] 3 A Solution: Arrays of Inexpensrve Disks RapId unprovements m capacity of large disks have not been the only target ofd& designers, smce personal computers have created a market for inexpensive magnetic disks These lower cost &sks have lower perfor- mance as well as less capacity Table I below compares the top-of-the-lme IBM 3380 model AK4 mamframe dtsk, FUJ~$U M2361A “Super Eagle” muucomputer disk, and the Conner Penpherals CP 3100 personal computer d& ChoroctensacS IBM FUJUSU Canners 3380 v 2361 v 3380 M2361A CP3100 3100 31Go (>I mmrr 3100 Is tt?tter) D&c dmmeter (mches) 14 105 35 4 3 Formatted DaraCapaclty (MB) 7500 600 100 01 2 Pr~ce/MB(controller mcl ) $18-$10 $20517 $lO-$7 l-25 17-3 MlTFRated (hours) 30,oLw 20@030,ooo 1 15 MlTF m pracUce (hours) 100,000 3 ? ?V No Actuators 4 1 1 2 1 MaxmuunUO’$econd/ActuaU~ 50 40 30 6 8 Typical I/O’s/second/Actuator JO 24 20 7 8 -~wdsecond/box 200 40 30 2 8 Typical VO’s/secondmox 120 24 20 2 8 Transfer Rate (MB/set) 3 25 1 3 4 Power/box (w) 6,600 640 10 660 64 Volume (cu ft ) 24 34 03 800 110 Table I Companson of IBM 3380 dtsk model AK4 for marnframe computers, the Fuptsu M2361A “Super Eagle” dtsk for rmnrcomputers, and the Conners Penpherals CP 3100 dtsk for personal computers By “‘MOxtmum Ilo’slsecond” we mean the rMxmtum number of average seeks and average rotates for a stngle sector access Cost and rehabthty rnfonnatzon on the 3380 comes from w&spread expertence [IBM 871 [hvh2k87] O?kd the lnformatlon on the FuJltsu from the manual [Fu& 871, whtle some numbers on the new CP3100 are based on speculatton The pnce per megabyte w gven as a range to allow for dflerent prices for volume &scount and d@rent mark-up practtces of the vendors (The 8 watt maximum power of the CP3100 was rncreased to 10 watts to allow for the tne&xency of an external power supply. stnce rhe other drives contan their awn power supphes) One suqmsmg fact is that the number of I/Ck per second per Bctuator in an inexpensive &Sk is within a factor of two of the large d&s In several of the remammg metrics, mcludmg pnce per megabyte, the mexpenslve disk ts supenor or equal to the large Qsks The small size and low power are even more Impressive since dsks such as the CP31CO contam full track buffers and most funcUons of the traditional mainframe controller Small disk manufacturers can provide such funcUons m high volume dusks because of the efforts of standards comm~ttces m defmmg hrgher level penpheral mterfaces. such as the ANSI x3 131-1986 Small Computer System Interface (SCSI) Such standards have encouraged companies bke Adeptec to offer SCSI mterfaces as single chips, m turn allowing &Sk compames to embed mamfiame controller functrons at low cost Figure 1 compares the uadltlonal mamframe dsk approach and the small computer disk approach 7%~. sine SCSI mterface chip emLxd&d as a controller m every disk can also be uSed aS the dmXt memory access @MA) deuce at the other end of the SCSI bus Such charactensUcs lead to our proposal for buddmg I/O systems as -YS of mexpenslve d&s, either mterleaved for the large tninsfers of supercomputers [I(lm 86]@vny 871[Satem861 or mdependent for the many small mnsfen of transacUon processmg Usmg the mformamn m ‘fable I, 75 ~~xpensrve disks potentmlly have 12 hmcs the I/O bandwIdth of the IBM 3380 and the same capacity, with lower power COnSUmpUOn and Cost 4 Caveats We cannot explore all issues associated with such -ys m the space avaIlable for this paper, so we ConCefltNte on fundamental estimates of price-performance and rehabduy Our reasoning IS that If there are no advantages m pnceperformance or temble d&vantages m rehabdlty, then there IS IIO need to explore further We chamctenze a transacUon-processing workload to evaluate performance of a col&Uon of iexpensive d&s. but remember that such a CollecUon is Just one hardware component of a complete tranacUon-processmg system While deslgnmg a complete TPS based on these ideas 1s enUcmg, we will resst that temptaUon m this paper Cabling and packagmg, certamly an issue m the cost and rehablhty of an array of many mexpenslve d&s, IS also beyond this paper’s scope Mainframe Small Computer CPU LJ 0% Memoly Channel . . . . . . CPU a dm Figure 1 Comparison of organizations for typlca/ mat&me and small compter ahk tnterfaces Stngle chrp SCSI tnte@ces such as the Adaptec MC-6250 allow the small computer to ure a single crUp to be the DMA tnterface as well as pronde an embedded controllerfor each dtsk [Adeptec 871 (The pnce per megabyte an Table I mcludes evetythtng zn the shaded box.?sabovc) 5. And Now The Bad News: Reliabihty The unrehabd~ty of d&s forces computer systems managers to make backup versions of mformaUon quite frequently m case of fmlure What would be the impact on relmbdlty of havmg a hundredfold Increase m disks? Assummg a constant fmlure rate--that is. an exponenhally dlsmbuted Ume to fadure--and that failures are Independent--both assumptmns made by dtsk manufacturers when cakulaUng the Mean Time To Fadure O--the zebablhty of an array of d&s IS MITF ofa slngtc &sk MTI’F of a Drsk Array = Number MDuks m the Array Using the mformatron m Table I. the MTTF of 100 CP 3100 d&s 1s 30,000/100 = 300 hours, or less than 2 weeks Compared to the 30,ooO hour (> 3 years) MTTF of the IBM 3380, this IS &smal If we consider scaling the army to 1000 disks, lhen the MTTF IS 30 hours or about one day, reqmrmg an ad~ecIne. worse rhan dismal Without fault tolerance, large arrays of mexpenstve Qsks are too unrehable to be useful 6. A Better Solution’ RAID To overcome the rebabtbty challenge, we must make use of extra d&s contammg redundant mformaUon to recover the ongmai mformatmn when a &Sk fads Our acronym for these Redundant Arrays of Inexpensn’e Disks IS RAID To sunplify the explanaUon of our final proposal and to avold confusmn wnh previous work, we give a taxonomy of five different orgamzaUons of dtsk arrays, begmnmg with murored disks and progressmg through a variety of ahemaUves with &ffenng performance and rehablhty We refer to each orgamzauon as a RAID level The reader should be forewarned that we describe all levels as If implemented m hardware solely to slmphfy the presentation, for RAID Ideas are apphcable to software implementauons as well as hardware Reltabthty Our baste approach will be to break the arrays into rellabrhty groups, with each group having extra “check” disks contammg redundant mformauon When a disk fads we assume that withm a short time the failed disk ~111 be replaced and the mformauon wdl be 110 recon~ acted on to the new dlbk usmg the redundant mformauon Th1.s time IS Ldled the mean time to repair (MlTR) The MTTR can be reduced If the system includes extra d&s to act as “hot” standby spares, when a disk fmls, a replacement disk IS swltched m elecrromcally Penodlcally a human operator replaces all faded d&s Here are other terms that we use D = total number of d&s with data (not mcludmg extra check d&s). G = number of data d&s m a group (not mcludmg extra check d&s), C = number of check d&s m a group, nG =D/G=nUmberOfgoUp& As menhoned above we make the same assumptions that disk manufacturers make--that fadura are exponenual and mdependent (An earthquake or power surge IS a sltuatlon where an array of d&s might not foul Independently ) Since these reliability prticuons wdl be very high, we want to emphasize that the rehabdlty IS only of the the &sk-head assemblies with this fmlure model, and not the whole software and electromc system In ad&non, m our view the pace of technology means extremely lugh WF are “overlull”--for, independent of expected bfeume, users will replace obsolete &sks After all, how many people are stdl using 20 year old d&s? The general MT’TF calculation for single-error repamng RAID 1s given III two steps Fmt, the group MTIF IS mFDtsk I MrrF,,,, = * G+C Probabdrty ofanotherfadure m a group b&re repamng the dead oisk As more formally denved m the appendix, the probabdlty of a second fa&nebeforethefirsthasbeenrepauedIs MlTR hill-R Probabdrty of = E Another Failure bfnF,,,,k /(No DIS~T- 1) MmF/j,k /(w-l) The mtmuon behmd the formal calculation m-the appendix comes from trymg to calculate the average number of second d& fdures durmg the repau time for X single &Sk fadures Since we assume that Qsk fadures occur at a umform rate, tha average number of second fa&ues durmg the rcpau tune for X first fadures 1s X *MlTR MlTF of remamtng d&s u) the group The average number of second fathues for a smgle d&z 1s then MlTR bfnFD,& / No Of W?UlUllIl~ drSkS l?l the group The MTTF of the retnaming disks IS Just the MTI’F of a smgle disk dnwkd by the number of go4 disks m the gmup. gwmg the result above The second step IS the reltablhty of the whole system, which IS approxl~~~teiy (smcc MITFGrow 1s not qmte titnbuted exponentrally) MTrFGrarp MTTFRAID = Pi Pluggmg It all together, we get. mFD,sk mFD,sk 1 MITFRAID = - * *- G+C (G+C-l)*MITR “c (MmFDtsk)2 = (G+C)*tlG * (G+C-l)*MITR Smce the formula 1s tbe same for each level, we make the abstract numbers concrete usmg these parameters as appropriate D=loO total data d&s, G=lO data disks per group, M7VDcsk = 30,000 hours, MmR = 1 hour, with the check d&s per group C detennmed by the RAID level Relubrlrty Overhead Cost This IS stmply the extra check disks. expressed as a percentage of the number of data &sks D As we shall see below, the cost vanes WIUI RAID level fmm 100% down to 4% Useable Storage Capacity Percentage Another way to express this rellabdlty overhead 1s m terms of the percentage of the total capacity of data &sks and check disks that can be used to store data Depending on the orgamauon, this vanes from a low of 50% to a high of 96% Performance Smce supercomputer applications and transaction-processing systems have &fferent access patterns and rates, we need different metncs to evaluate both For supercomputers we count the number of reads and wnte.s per second for large blocks of data, with large defined as gettmg at least one sector from each data d& III a group Durmg large transfers all the disks m a group act as a stngle umt, each readmg or wntmg a pomon of the large data block m parallel A better measure for transacuon-processmg systems s the number of indlvrdual reads or writes per second Smce transacuon-processing systems (e g , deblts/cre&ts) use a read-modify-wnte sequence of disk accesses, we mclude that metnc as well Ideally durmg small transfers each dsk m a group can act mdepe&ndy. e~thez readmg or wntmg mdependent mfonnatmn In summary supercomputer applicauons need a hrgh dura rure whale transacuon-pmcessm g need a hrgh II0 rate For both the large and small transfer calculauons we assume the mlmmum user request IS a sector, that a sector 1s small relauve to a track, and that there 1s enough work to keep every devtce busy Thus sector size affects both dusk storage efficiency and transfer sue Figure 2 shows the uiealoperauonoflargeandsmall~accessesmaRAID (a) Stngle Large or “Graupcd” Read (lreadqwadoverGd&s) 1tt 1 q nl .*. (b) Several Smll or Indmdual Reads and Writes (GndsandlorwntcsqmndawrG&sks) Figure 2. Large tramfer vs small tran$ers WI a group of G d&s The SIX pelformauce memcs are then the number of reads, wntes, and read-mod@-writes per second for both large (grouped) or small (mdlvldual) transfers Rather than @ve absolute numbers for each memc, we calculate efficiency the number of events per second for a RAID relative to the corrcqondmg events per second for a smgle dusk (This ts Boral’s I/O bandwidth per ggabyte moral 831 scaled to glgabytes per disk ) In Uns pap we are after fundamental Mferences so we use ample. demmmlstlc throughput measures for our pezformance memc rather than latency Effective Performance Per Dnk The cost of d&s can be a large portmn of the cost of a database system, so the I/O performance per disk--factonng m the overhead of the check disks--suggests the cost/performance of a system ‘flus IS the bottom line for a RAID 111 I 7. First Level RAID: Mwrored Disks Mmored dusks are 11 tradmonal approach for lmprovmg rellabdlty of magneuc disks This IS the most expensive opuon we consider since all tiks are duplicated (G=l and C=l). and eve.ry wnte to a data dusk 1s also a wnte to a check &Sk Tandem doubles the number of controllers for fault tolerance, allowing an opwnized version of mirrored d&s that lets reads occur m parallel Table II shows the memcs for a Level 1 RAID assummg this optnnuatton MTTF Total Number of D&s Ovcrhcad Cost Usecrble Storage Capacity Exceeds Useful Roduct Ltiwne (4500.000 hrs or > 500 years) 2D 100% 50% Eventslscc vs Smgle Disk Full RAID E@caency Per Disk hrge (or Grouped) Readr ws 1 00/s Large (or Grouped) Wrues D/S 50/S Large (or Grouped) R-M-W 4Dl3S 67/S Small (or Indsvuiual) Rends W 100 Small (or hd~vuiual) Writes D 50 Small (or In&dual) R-M-W 4D/3 61 Table II. Charactenstrcs of Level 1 RAID Here we assume that writes are not slowed by waztrng jar the second wrote to complete because the slowdown for writing 2 dtsks 1s mtnor compared to the slowdown S for wntrng a whole group of 10 lo 25 d&s Unltke a “pure” mtrrored scheme wtth extra &As that are mvlsrble to the s&ware, we assume an optmuted scheme with twice as many controllers allowtng parallel reads to all d&s, grvmg full disk bandwidth for large reads and allowtng the reads of rea&noaijj-nntes to occw in paralbzl When mdwldual accesses am dlsmbuted acmss muluple d&s, average queuemg. seek, and rotate delays may &ffer from the smgle Qsk case Although bandwidth may be unchanged, it is Qsmbuted more evenly, reducing vanance m queuemg delay and, If the disk load IS not too high, also reducmg the expected queuemg delay through parallebsm [Llvny 871 When many arms seek to the same track then rotate to the described sect
本文档为【A Case for Redundant Arrays of Inexpensive Disks (RAID)】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
下载需要: 免费 已有0 人下载