A Case for Redundant Arrays of Inexpensive Disks (RAID)
Davtd A Patterson, Garth Gibson, and Randy H Katz
Computer Saence D~v~smn
Department of Elecmcal Engmeermg and Computer Sclencea
571 Evans Hall
Umversity of Cabforma
Berkeley. CA 94720
(partrsl@WF -kY du)
Abstract Increasmg performance of CPUs and memorres wrll be
squandered lf not matched by a sunrlm peformance ourease m II0 Whde
the capactty of Smgle Large Expenstve D&T (SLED) has grown rapuily,
the performance rmprovement of SLED has been modest Redundant
Arrays of Inexpensive Disks (RAID), based on the magnetic duk
technology developed for personal computers, offers an attractive
alternattve IO SLED, promtang onprovements of an or&r of mogm&e m
pctformance, rehabdlty, power consumption, and scalalnlrty Thu paper
rntroducesfivc levels of RAIDS, grvmg rheu relative costlpetfotmance, and
compares RAID to an IBM 3380 and a Fupisu Super Eagle
1 Background: Rlsrng CPU and Memory Performance
The users of computers are currently enJoymg unprecedented growth
m the speed of computers Gordon Bell said that between 1974 and 1984.
smgle chip computers improved m performance by 40% per year, about
twice the rate of mmlcomputers [Bell 841 In the followmg year B111 Joy
predicted an even faster growth [Joy 851
Mamframe and supercomputer manufacturers, havmg &fficulty keeping
pace with the rapId growth predicted by “Joy’s Law,” cope by offermg
m&processors as theu top-of-the-lme product.
But a fast CPU does not a fast system make Gene Amdahl related
CPU speed to mam memory s12e usmg this rule [Siewmrek 821
Each CPU mnstrucaon per second requues one byte of moan memory,
If computer system costs are not to be dommated by the cost of memory,
then Amdahl’s constant suggests that memory chip capacity should grow
at the same rate Gordon Moore pr&cted that growth rate over 20 years
fransuforslclup = 2y*-1%4
AK predzted by Moore’s Law, RAMs have quadrupled m capacity every
twotMoom75110threeyeaFIyers861
Recently the rauo of megabytes of mam memory to MIPS ha9 been
defti as ahha [Garcm 841. vvlth Amdahl’s constant meanmg alpha = 1 In
parl because of the rapti drop of memory prices, mam memory we.9 have
grownfastexthanCPUspeedsandmanymachmesare~ppedtoday~th
alphas of 3 or tigha
To mamtam the balance of costs m computer systems, secondary
storage must match the advances m other parts of the system A key meas-
Pemuswn to copy mthout fee all or w of &IS matcnal IS granted pronded that the COP!S
zzrc not made or lstnbuted for dwct commernal advantage, the ACM copyright notIce
and the tltk of the pubbcatuon and IW da’, appear, and notxe IS @“en that COPYI"K IS by
pemtrs~on of the Association for Computing Machtnery To COPY otherwIse, or to
repubbsh, requres B fee and/or spenfic perm~ss~o”
0 1988 ACM 0-89791~268-3/88/~/OlOP $1 50
ure of magneuc tik technology 1s the growth m the maxnnum number of
bits that can be stored per square mch, or the bits per mch m a track
umes the number of tracks per mch Called MA D , for maxunal area1
density, the “Fmt Law m Disk Density” predicts ~rank87]
MAD = lo(Year-1971)/10
Magnettc dd technology has doubled capacity and halved pnce every three
years, m hne with the growth rate of semiconductor memory, and m
practice between 1967 and 1979 the dtsk capacity of the average IBM data
processmg system more than kept up with its mam memory [Stevens81 ]
Capacity IS not the o~rty memory charactensuc that must grow
rapidly to mamtam system balance, since the speed with which
msuuctions and data are delivered to a CPU also determmes its ulamdte
perfarmanceThespeedof~mem~has~tpacefoPtworeasons
(1) the mvenuon of caches, showmg that a small buff= can be managed
automamzally to contain a substanttal fractmn of memory refaences.
(2) and the SRAM technology, used to build caches, whose speed has
lmpmvedattherateof4O%tolOO%peryear
In umtmst to pnmary memory technologres, the performance of
single large expensive ma8netuz d&s (SLED) has improved at a modest
rate These mechamcal devu~ are dommated by the seek and the rotahon
delays from 1971 to 1981, the raw seek tune for a high-end IBM disk
improved by only a factor of two whllt the rocstlon hme did not
cbange[Harkex811 Greater denslty means a lugher transfer rate when the
mformatmn 1s found. and extra heads can educe the aveaage seek tnne, but
the raw seek hme only unproved at a rate of 7% per year There 1s no
reasontoexpectafasterratemthenearfuture
To mamtam balance, computer systems have been usmg even larger
mam memones or solid state d&s to buffer some of the I/O acttvlty
This may be a fine solutron for apphcattons whose I/O actrvlty has
locality of reference and for which volatlltty 1s not an issue. but
appbcauons dommated by a high rate of random muests for small peces
of data (such BS tmmact~on-pmcessmg) or by a low number of requests for
massive amounts of data (such as large simulahons nmnmg on
supercomputers) are facmg a sermus p&mnance hmuatmn
2. The Pendrng I/O Crisw
What t3 the Impact of lmprovmg the performance of sOme pieces of a
problem while leavmg others the same? Amdahl’s answer IS now known
asAmdahl'sLaw[Amdahl67]
1
S z
(1-n +flk
Whae
S = the effecttve speedup,
f=fractmnofworkmfastermode,and
k = speedup whde m faster mode
I/G
Suppose that some current appbcatmns spend 10% of thev ume In
Then when computers are 10X faster--accordmg to Bdl Joy m JUSt
Over thtte years--then Amdahl’s Law predicts efQcove speedup wdl be only
5X When we have computers lOOX faster--vm evolutmn of umprcuzessors
or by multiprocessors-&s applrcatlon will be less than 10X faster,
wastmg 90% of the potenhal speedup
Whde we can lmagme improvements m software file systems via
buffcrmg for near term 40 demands, we need mnovaUon to avoid an J./O
crms [Boral83]
3 A Solution: Arrays of Inexpensrve Disks
RapId unprovements m capacity of large disks have not been the only
target ofd& designers, smce personal computers have created a market for
inexpensive magnetic disks These lower cost &sks have lower perfor-
mance as well as less capacity Table I below compares the top-of-the-lme
IBM 3380 model AK4 mamframe dtsk, FUJ~$U M2361A “Super Eagle”
muucomputer disk, and the Conner Penpherals CP 3100 personal
computer d&
ChoroctensacS IBM FUJUSU Canners 3380 v 2361 v
3380 M2361A CP3100 3100 31Go
(>I mmrr
3100 Is tt?tter)
D&c dmmeter (mches) 14 105 35 4 3
Formatted DaraCapaclty (MB) 7500 600 100 01 2
Pr~ce/MB(controller mcl ) $18-$10 $20517 $lO-$7 l-25 17-3
MlTFRated (hours) 30,oLw 20@030,ooo 1 15
MlTF m pracUce (hours) 100,000 3 ? ?V
No Actuators 4 1 1 2 1
MaxmuunUO’$econd/ActuaU~ 50 40 30 6 8
Typical I/O’s/second/Actuator JO 24 20 7 8
-~wdsecond/box 200 40 30 2 8
Typical VO’s/secondmox 120 24 20 2 8
Transfer Rate (MB/set) 3 25 1 3 4
Power/box (w) 6,600 640 10 660 64
Volume (cu ft ) 24 34 03 800 110
Table I Companson of IBM 3380 dtsk model AK4 for marnframe
computers, the Fuptsu M2361A “Super Eagle” dtsk for rmnrcomputers,
and the Conners Penpherals CP 3100 dtsk for personal computers By
“‘MOxtmum Ilo’slsecond” we mean the rMxmtum number of average seeks
and average rotates for a stngle sector access Cost and rehabthty
rnfonnatzon on the 3380 comes from w&spread expertence [IBM 871
[hvh2k87] O?kd the lnformatlon on the FuJltsu from the manual [Fu&
871, whtle some numbers on the new CP3100 are based on speculatton
The pnce per megabyte w gven as a range to allow for dflerent prices for
volume &scount and d@rent mark-up practtces of the vendors (The 8
watt maximum power of the CP3100 was rncreased to 10 watts to allow
for the tne&xency of an external power supply. stnce rhe other drives
contan their awn power supphes)
One suqmsmg fact is that the number of I/Ck per second per Bctuator in an
inexpensive &Sk is within a factor of two of the large d&s In several of
the remammg metrics, mcludmg pnce per megabyte, the mexpenslve disk
ts supenor or equal to the large Qsks
The small size and low power are even more Impressive since dsks
such as the CP31CO contam full track buffers and most funcUons of the
traditional mainframe controller Small disk manufacturers can provide
such funcUons m high volume dusks because of the efforts of standards
comm~ttces m defmmg hrgher level penpheral mterfaces. such as the ANSI
x3 131-1986 Small Computer System Interface (SCSI) Such standards
have encouraged companies bke Adeptec to offer SCSI mterfaces as single
chips, m turn allowing &Sk compames to embed mamfiame controller
functrons at low cost Figure 1 compares the uadltlonal mamframe dsk
approach and the small computer disk approach 7%~. sine SCSI mterface
chip emLxd&d as a controller m every disk can also be uSed aS the dmXt
memory access @MA) deuce at the other end of the SCSI bus
Such charactensUcs lead to our proposal for buddmg I/O systems as
-YS of mexpenslve d&s, either mterleaved for the large tninsfers of
supercomputers [I(lm 86]@vny 871[Satem861 or mdependent for the many
small mnsfen of transacUon processmg Usmg the mformamn m ‘fable
I, 75 ~~xpensrve disks potentmlly have 12 hmcs the I/O bandwIdth of the
IBM 3380 and the same capacity, with lower power COnSUmpUOn and Cost
4 Caveats
We cannot explore all issues associated with such -ys m the space
avaIlable for this paper, so we ConCefltNte on fundamental estimates of
price-performance and rehabduy Our reasoning IS that If there are no
advantages m pnceperformance or temble d&vantages m rehabdlty, then
there IS IIO need to explore further We chamctenze a transacUon-processing
workload to evaluate performance of a col&Uon of iexpensive d&s. but
remember that such a CollecUon is Just one hardware component of a
complete tranacUon-processmg system While deslgnmg a complete TPS
based on these ideas 1s enUcmg, we will resst that temptaUon m this
paper Cabling and packagmg, certamly an issue m the cost and rehablhty
of an array of many mexpenslve d&s, IS also beyond this paper’s scope
Mainframe Small Computer
CPU LJ
0% Memoly Channel
. . .
. . .
CPU
a
dm
Figure 1 Comparison of organizations for typlca/ mat&me and small
compter ahk tnterfaces Stngle chrp SCSI tnte@ces such as the Adaptec
MC-6250 allow the small computer to ure a single crUp to be the DMA
tnterface as well as pronde an embedded controllerfor each dtsk [Adeptec
871 (The pnce per megabyte an Table I mcludes evetythtng zn the shaded
box.?sabovc)
5. And Now The Bad News: Reliabihty
The unrehabd~ty of d&s forces computer systems managers to make
backup versions of mformaUon quite frequently m case of fmlure What
would be the impact on relmbdlty of havmg a hundredfold Increase m
disks? Assummg a constant fmlure rate--that is. an exponenhally
dlsmbuted Ume to fadure--and that failures are Independent--both
assumptmns made by dtsk manufacturers when cakulaUng the Mean Time
To Fadure O--the zebablhty of an array of d&s IS
MITF ofa slngtc &sk
MTI’F of a Drsk Array =
Number MDuks m the Array
Using the mformatron m Table I. the MTTF of 100 CP 3100 d&s 1s
30,000/100 = 300 hours, or less than 2 weeks Compared to the 30,ooO
hour (> 3 years) MTTF of the IBM 3380, this IS &smal If we consider
scaling the army to 1000 disks, lhen the MTTF IS 30 hours or about one
day, reqmrmg an ad~ecIne. worse rhan dismal
Without fault tolerance, large arrays of mexpenstve Qsks are too
unrehable to be useful
6. A Better Solution’ RAID
To overcome the rebabtbty challenge, we must make use of extra
d&s contammg redundant mformaUon to recover the ongmai mformatmn
when a &Sk fads Our acronym for these Redundant Arrays of Inexpensn’e
Disks IS RAID To sunplify the explanaUon of our final proposal and to
avold confusmn wnh previous work, we give a taxonomy of five different
orgamzaUons of dtsk arrays, begmnmg with murored disks and progressmg
through a variety of ahemaUves with &ffenng performance and rehablhty
We refer to each orgamzauon as a RAID level
The reader should be forewarned that we describe all levels as If
implemented m hardware solely to slmphfy the presentation, for RAID
Ideas are apphcable to software implementauons as well as hardware
Reltabthty Our baste approach will be to break the arrays into
rellabrhty groups, with each group having extra “check” disks contammg
redundant mformauon When a disk fads we assume that withm a short
time the failed disk ~111 be replaced and the mformauon wdl be
110
recon~ acted on to the new dlbk usmg the redundant mformauon Th1.s
time IS Ldled the mean time to repair (MlTR) The MTTR can be reduced
If the system includes extra d&s to act as “hot” standby spares, when a
disk fmls, a replacement disk IS swltched m elecrromcally Penodlcally a
human operator replaces all faded d&s Here are other terms that we use
D = total number of d&s with data (not mcludmg extra check d&s).
G = number of data d&s m a group (not mcludmg extra check d&s),
C = number of check d&s m a group,
nG =D/G=nUmberOfgoUp&
As menhoned above we make the same assumptions that disk
manufacturers make--that fadura are exponenual and mdependent (An
earthquake or power surge IS a sltuatlon where an array of d&s might not
foul Independently ) Since these reliability prticuons wdl be very high,
we want to emphasize that the rehabdlty IS only of the the &sk-head
assemblies with this fmlure model, and not the whole software and
electromc system In ad&non, m our view the pace of technology means
extremely lugh WF are “overlull”--for, independent of expected bfeume,
users will replace obsolete &sks After all, how many people are stdl
using 20 year old d&s?
The general MT’TF calculation for single-error repamng RAID 1s
given III two steps Fmt, the group MTIF IS
mFDtsk I
MrrF,,,, = *
G+C Probabdrty ofanotherfadure m a group
b&re repamng the dead oisk
As more formally denved m the appendix, the probabdlty of a second
fa&nebeforethefirsthasbeenrepauedIs
MlTR hill-R
Probabdrty of = E
Another Failure bfnF,,,,k /(No DIS~T- 1) MmF/j,k /(w-l)
The mtmuon behmd the formal calculation m-the appendix comes
from trymg to calculate the average number of second d& fdures durmg
the repau time for X single &Sk fadures Since we assume that Qsk fadures
occur at a umform rate, tha average number of second fa&ues durmg the
rcpau tune for X first fadures 1s
X *MlTR
MlTF of remamtng d&s u) the group
The average number of second fathues for a smgle d&z 1s then
MlTR
bfnFD,& / No Of W?UlUllIl~ drSkS l?l the group
The MTTF of the retnaming disks IS Just the MTI’F of a smgle disk
dnwkd by the number of go4 disks m the gmup. gwmg the result above
The second step IS the reltablhty of the whole system, which IS
approxl~~~teiy (smcc MITFGrow 1s not qmte titnbuted exponentrally)
MTrFGrarp
MTTFRAID =
Pi
Pluggmg It all together, we get.
mFD,sk mFD,sk 1
MITFRAID = - * *-
G+C (G+C-l)*MITR “c
(MmFDtsk)2
= (G+C)*tlG * (G+C-l)*MITR
Smce the formula 1s tbe same for each level, we make the abstract
numbers concrete usmg these parameters as appropriate D=loO total data
d&s, G=lO data disks per group, M7VDcsk = 30,000 hours, MmR = 1
hour, with the check d&s per group C detennmed by the RAID level
Relubrlrty Overhead Cost This IS stmply the extra check
disks. expressed as a percentage of the number of data &sks D As we shall
see below, the cost vanes WIUI RAID level fmm 100% down to 4%
Useable Storage Capacity Percentage Another way to
express this rellabdlty overhead 1s m terms of the percentage of the total
capacity of data &sks and check disks that can be used to store data
Depending on the orgamauon, this vanes from a low of 50% to a high of
96%
Performance Smce supercomputer applications and
transaction-processing systems have &fferent access patterns and rates, we
need different metncs to evaluate both For supercomputers we count the
number of reads and wnte.s per second for large blocks of data, with large
defined as gettmg at least one sector from each data d& III a group Durmg
large transfers all the disks m a group act as a stngle umt, each readmg or
wntmg a pomon of the large data block m parallel
A better measure for transacuon-processmg systems s the number of
indlvrdual reads or writes per second Smce transacuon-processing
systems (e g , deblts/cre&ts) use a read-modify-wnte sequence of disk
accesses, we mclude that metnc as well Ideally durmg small transfers each
dsk m a group can act mdepe&ndy. e~thez readmg or wntmg mdependent
mfonnatmn In summary supercomputer applicauons need a hrgh dura rure
whale transacuon-pmcessm g need a hrgh II0 rate
For both the large and small transfer calculauons we assume the
mlmmum user request IS a sector, that a sector 1s small relauve to a track,
and that there 1s enough work to keep every devtce busy Thus sector size
affects both dusk storage efficiency and transfer sue Figure 2 shows the
uiealoperauonoflargeandsmall~accessesmaRAID
(a) Stngle Large or “Graupcd” Read
(lreadqwadoverGd&s)
1tt 1
q nl .*.
(b) Several Smll or Indmdual Reads and Writes
(GndsandlorwntcsqmndawrG&sks)
Figure 2. Large tramfer vs small tran$ers WI a group of G d&s
The SIX pelformauce memcs are then the number of reads, wntes, and
read-mod@-writes per second for both large (grouped) or small (mdlvldual)
transfers Rather than @ve absolute numbers for each memc, we calculate
efficiency the number of events per second for a RAID relative to the
corrcqondmg events per second for a smgle dusk (This ts Boral’s I/O
bandwidth per ggabyte moral 831 scaled to glgabytes per disk ) In Uns
pap we are after fundamental Mferences so we use ample. demmmlstlc
throughput measures for our pezformance memc rather than latency
Effective Performance Per Dnk The cost of d&s can be a
large portmn of the cost of a database system, so the I/O performance per
disk--factonng m the overhead of the check disks--suggests the
cost/performance of a system ‘flus IS the bottom line for a RAID
111
I
7. First Level RAID: Mwrored Disks
Mmored dusks are 11 tradmonal approach for lmprovmg rellabdlty of
magneuc disks This IS the most expensive opuon we consider since all
tiks are duplicated (G=l and C=l). and eve.ry wnte to a data dusk 1s also a
wnte to a check &Sk Tandem doubles the number of controllers for fault
tolerance, allowing an opwnized version of mirrored d&s that lets reads
occur m parallel Table II shows the memcs for a Level 1 RAID assummg
this optnnuatton
MTTF
Total Number of D&s
Ovcrhcad Cost
Usecrble Storage Capacity
Exceeds Useful Roduct Ltiwne
(4500.000 hrs or > 500 years)
2D
100%
50%
Eventslscc vs Smgle Disk Full RAID E@caency Per Disk
hrge (or Grouped) Readr ws 1 00/s
Large (or Grouped) Wrues D/S 50/S
Large (or Grouped) R-M-W 4Dl3S 67/S
Small (or Indsvuiual) Rends W 100
Small (or hd~vuiual) Writes D 50
Small (or In&dual) R-M-W 4D/3 61
Table II. Charactenstrcs of Level 1 RAID Here we assume that writes
are not slowed by waztrng jar the second wrote to complete because the
slowdown for writing 2 dtsks 1s mtnor compared to the slowdown S for
wntrng a whole group of 10 lo 25 d&s Unltke a “pure” mtrrored scheme
wtth extra &As that are mvlsrble to the s&ware, we assume an optmuted
scheme with twice as many controllers allowtng parallel reads to all d&s,
grvmg full disk bandwidth for large reads and allowtng the reads of
rea&noaijj-nntes to occw in paralbzl
When mdwldual accesses am dlsmbuted acmss muluple d&s, average
queuemg. seek, and rotate delays may &ffer from the smgle Qsk case
Although bandwidth may be unchanged, it is Qsmbuted more evenly,
reducing vanance m queuemg delay and, If the disk load IS not too high,
also reducmg the expected queuemg delay through parallebsm [Llvny 871
When many arms seek to the same track then rotate to the described sect
本文档为【A Case for Redundant Arrays of Inexpensive Disks (RAID)】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。