Chapter 1
3
A Case for Redundant Arrays of Inexpensive
Disks (RAID)
DAVID A. PATTERSON, GARTH GIBSON,
AND RANDY H. KATZ
Computer Science Division
Department of Electrical Engineering and Computer Sciences
571 Evans Hall
University of California
Berkeley, CA 94720
Abstract
Increasing performance of CPUs and memories will be
squandered if not matched by a similar performance
increase in I/O. While the capacity of Single Large
Expensive Disk (SLED) has grown rapidly, the
performance improvement of SLED has been modest.
Redundant Arrays of Inexpensive Disks (RAID), based on
the magnetic disk technology developed for personal
computers, offers an attractive alternative to SLED,
promising improvements of an order of magnitude in
performance, reliability, power consumption, and
scalability. This paper introduces five levels of RAIDs,
giving their relative cost/performance, and compares
RAIDs to an IBM 3380 and a Fujitsu Super Eagle.
1.1 Background: Rising CPU and
Memory Performance
The users of computers are currently enjoying
unprecedented growth in the speed of computers. Gordon
Bell said that between 1974 and 1984, single chip
computers improved in performance by 40% per year,
about twice the rate of minicomputers [Bell 84]. In the
following year Bill Joy predicted an even faster growth
[Joy 85]
MIPS = 2Year-1984
Mainframe and supercomputer manufacturers, having
difficulty keeping pace with this rapid growth predicted by
Reprint from Proceedings of the ACM SIGMOD International
Conference on Management of Data, pp.109-116, June 1988
“Joy's Law”, cope by offering multiprocessors as their top-
of-the-line product.
But a fast CPU does not a fast system make. Gene
Amdahl related CPU speed to main memory size using
this rule [Siewiorek 82]
Each CPU instruction per second requires one byte of
main memory.
If computer system costs are not to be dominated by the
cost of memory, then Amdahl's constant suggests that
memory chip capacity should grow at the same rate.
Gordon Moore predicted that growth rate over 20 years
ago
transistors/chip = 2Year-1964
As predicted by Moore’s Law, RAMs have quadrupled in
capacity every two [Moore 75] to three years [Moore 86].
Recently this ratio of megabytes of main memory to
MIPS has been defined as alpha [Garcia 84], with
Amdahl's constant meaning alpha = 1. In part because of
the rapid drop of memory prices, main memory sizes have
grown faster than CPU speeds and many machines are
shipped today with alphas of 3 or higher.
To maintain the balance of costs in computer systems,
secondary storage must match the advances in other parts
of the system. A key measure of disk technology is the
growth in the maximum number of bits that can be stored
per square inch, or the bits per inch in a track times the
number of tracks per inch Called M A D, for maximal
areal density, the “First Law in Disk Density” predicts
[Frank87]
MAD = 10(Year-1971)/10
Magnetic disk technology has doubled capacity and
halved price every three years, in line with the growth rate
of semiconductor memory, and in practice between 1967
and 1979 the disk capacity of the average IBM data
processing system more than kept up with its main
memory [Stevens81].
Capacity is not the only memory characteristic that
must grow rapidly to maintain system balance, since the
speed with which instructions and data are delivered to a
wangxinalex
高亮
wangxinalex
高亮
wangxinalex
高亮
wangxinalex
高亮
wangxinalex
高亮
wangxinalex
高亮
wangxinalex
高亮
wangxinalex
高亮
wangxinalex
高亮
wangxinalex
高亮
wangxinalex
高亮
wangxinalex
高亮
4 Part I Introduction to Redundant Disk Array Architecture
CPU also determines its ultimate performance. The speed
of main memory has kept pace for two reasons:
(1) the invention of caches, showing that a small buffer
can be managed automatically to contain a substantial
fraction of memory references;
(2) and the SRAM technology, used to build caches,
whose speed has improved at the rate of 40% to 100% per
year.
In contrast to primary memory technologies, the
performance of single large expensive magnetic disks
(SLED) has improved at a modest rate. These mechanical
devices are dominated by the seek and the rotation delays:
from 1971 to 1981, the raw seek time for a high-end IBM
disk improved by only a factor of two while the rotation
time did not change [Harker81]. Greater density means a
higher transfer rate when the information is found, and
extra heads can reduce the average seek time, but the raw
seek time only improved at a rate of 7% per year. There is
no reason to expect a faster rate in the near future.
To maintain balance, computer systems have been
using even larger main memories or solid state disks to
buffer some of the I/O activity. This may be a fine solution
for applications whose I/O activity has locality of
reference and for which volatility is not an issue, but
applications dominated by a high rate of random requests
for small pieces of data (such as transaction-processing) or
by a low number of requests for massive amounts of data
(such as large simulations running on supercomputers) are
facing a serious performance limitation.
1.2 The Pending I/O Crisis
What is the impact of improving the performance of some
pieces of a problem while leaving others the same?
Amdahl's answer is now known as Amdahl's Law
[Amdahl67]
kffS /)1(
1
+−
=
where:
S = the effective speedup,
f = fraction of work in faster mode, and
k = speedup while in faster mode.
Suppose that some current applications spend 10% of
their time in I/O. Then when computers are 10X faster -
according to Bill Joy in just over three years--then
Amdahl's Law predicts effective speedup will be only 5X.
When we have computers 100X faster - via evolution of
uniprocessors or by multiprocessors - this application will
be less than 10X faster, wasting 90% of the potential
speedup.
While we can imagine improvements in software file
systems via buffering for near term I/O demands, we need
innovation to avoid an I/O crisis [Boral 83].
1.3 A Solution: Arrays of Inexpensive
Disks
Rapid improvements in capacity of large disks have not
been the only target of disk designers, since personal
computers have created a market for inexpensive magnetic
disks. These lower cost disks have lower performance as
well as less capacity. Table 1.1 below compares the top-
of-the-line IBM 3380 model AK4 mainframe disk, Fujitsu
M2361A "Super Eagle" minicomputer disk, and the
Conner Peripherals CP 3100 personal computer disk.
Table 1.1 Comparison of IBM 3380 disk model AK4 for
mainframe computers, the Fujitsu M2361A "Super Eagle"
disk for minicomputers, and the Conners Peripherals CP
3100 disk for personal computers. By "Maximum
I/O's/second" we mean the maximum number of average
seeks and average rotates for a single sector access. Cost
and reliability information on the 3380 comes from
widespread experience [IBM 87] [Gawlick87] and the
information on the Fujitsu from the manual [Fujitsu 87],
while some numbers on the new CP3100 are based on
speculation. The price per megabyte is given as a range to
allow for different prices for volume discount and
different mark-up practices of the vendors. (The 8 watt
maximum power of the CP3100 was increased to 10 watts
to allow for the inefficiency of an external power supply
(since the other drives contain their own power supplies).
One surprising fact is that the number of I/Os per
second per actuator in an inexpensive disk is within a
factor of two of the large disks. In several of the remaining
metrics, including price per megabyte, the inexpensive
disk is superior or equal to the large disks.
The small size and low power are even more
impressive since disks such as the CP3100 contain full
track buffers and most functions of the traditional
mainframe controller. Small disk manufacturers can
Administrator
高亮
Chapter 1 A Case for Redundant Arrays of Inexpensive Disks (RAID) 5
provide such functions in high volume disks because of
the efforts of standards committees in defining higher
level peripheral interfaces, such as the ANSI X3.131-1986
Small Computer Synchronous Interface (SCSI). Such
standards have encouraged companies like Adaptec to
offer SCSI interfaces as single chips, in turn allowing disk
companies to embed mainframe controller functions at
low cost. Figure 1.1 compares the traditional mainframe
disk approach and the small computer disk approach. The
same SCSI interface chip embedded as a controller in
every disk can also be used as the direct memory access
(DMA) device at the other end of the SCSI bus.
Figure 1.1 Comparison of organizations for typical
mainframe and small computer disk interface. Single chip
SCSI interfaces such as the Adaptec AIC-6250 allow the
small computer to use a single chip to be the DMA
interface as well as provide an embedded controller for
each disk [Adaptec87] (The price per megabyte in Table
1.1 includes everything in the shaded boxes above).
Such characteristics lead to the proposal of building I/O
systems as arrays of inexpensive disks, either interleaved
for the large transfers of supercomputers [Kim 86][Livny
87][Salem 86] or independent for the many small transfers
of transaction processing. Using the information in Table I,
75 inexpensive disks potentially have 12 times the I/O
bandwidth of the IBM 3380 and the same capacity, with
lower power consumption and cost.
1.4 Caveats
We cannot explore all issues associated with such arrays
in the space available for this paper, so we concentrate on
the price-performance and reliability. Our reasoning is that
there are no advantages in price-performance or terrible
disadvantages in reliability, then there is no need to
explore further. We characterize the transaction-
processing workload to evaluate performance of a
collection of inexpensive disks, but remember that such a
collection is just one hardware component of a complete
transaction-processing system. While designing a
complete TPS based on these ideas is enticing, we will
resist that temptation in this paper. Cabling and packaging,
certainly an issue in the cost and reliability of an array of
many inexpensive disks, is also beyond this paper's scope.
1.5 And Now The Bad News: Reliability
The unreliability of disks forces computer systems
managers to make backup versions of information quite
frequently in case of failure. What would be the impact on
reliability of having a hundredfold increase in disks?
Assuming a constant failure rate--that is, an exponentially
distributed time to failure - and that failures are
independent--both assumptions made by disk
manufacturers when calculating the Mean Time To Failure
(MTTF) - the reliability of an array of disks is:
ArraytheinDisksofNumber
DiskSingleaofMTTFArrayDiskaofMTTF =
Using the information in Table 1.1, the MTTF of 100
CP 3100 disks is 30,000/100 = 300 hours, or less than 2
weeks. Compared to the 30,000 hour (> 3 years) MTTF of
the IBM 3380, this is dismal. If we consider scaling the
array to 1000 disks, then the MTTF is 30 hours or about
one day, requiring an adjective worse than dismal.
Without fault tolerance, large arrays of inexpensive
disks are too unreliable to be useful.
1.6 A Better Solution: RAID
To overcome the reliability challenge, we must make use
of extra disks containing redundant information to recover
the original information when a disk fails. Our acronym
for these Redundant Arrays of Inexpensive Disks is RAID.
To simplify the explanation of our final proposal and to
avoid confusion with previous work, we give the
taxonomy of five different organizations of disk arrays,
beginning with mirrored disks and progressing through a
variety of alternatives with differing performance and
reliability. We refer to each organization as a RAID level.
The reader should be forewarned that we describe all
levels as if implemented in hardware solely to simplify the
presentation, for RAID ideas are applicable to software
implementations as well as hardware.
Reliability Our basic approach will be to break the
arrays into reliability groups, with each group having extra
"check" disks containing the redundant information. When
a disk fails we assume that within a short time the failed
disk will be replaced and the information will be
reconstructed on to the new disk using the redundant
information. This time is called the mean time to repair
6 Part I Introduction to Redundant Disk Array Architecture
(MTTR). The MTTR can be reduced if the system
includes extra disks to act as "hot" standby spares; when a
disk fails, a replacement disk is switched in electronically.
Periodically the human operator replaces all failed disks.
Here are some other terms that we use:
D = total number of disks with data (not including the
extra check disks);
G = number of data disks in a group (not including the
extra check disks);
C = number of check disks in a group;
nG = D/G = number of groups.
As mentioned above we make the same assumptions
that the disk manufacturers make--that the failures are
exponential and independent. (An earthquake or power
surge is a situation where an array of disks might not fail
independently.) Since these reliability predictions will be
very high, we want to emphasize that the reliability is only
of the disk-head assembles with this failure model, and not
the whole software and electronic system. In addition, in
our view the pace of technology means extremely high
MTTF are “overkill” - for, independent of expected
lifetime, users will replace obsolete disks. After all, how
many people are still using 20 years old disks?
The general MTTF calculation for single-error
repairing RAID is given in two steps. First, the group
MTTF is
As more formally derived in the appendix, the probability
of a second failure before the first has been repaired is
The intention behind the formal calculation in the
appendix comes from trying to calculate the average
number of second disk failure during the repair time for X
single disk failures. Since we assume that disk failure
occur at a uniform rate, this average number of second
failure during the repair time for X first failure is
grouptheindisksremainingofMTTF
MTTRX *
The average number of second failure for a single disk is
then
grouptheindisksremainingofNoMTTF
MTTR
Disk /
The MTTF of the remaining disks is just the MTTF of a
single disk divided by the number of good disks in the
group, giving the result above.
The second step is the reliability of the whole system,
which is approximately (since MTTFGroup is not quite
distributed exponentially)
G
MTTF
MTTF
n
Group
RAID =
Plugging it all together, we get:
MTTRCGGCG
MTTF
GMTTRCG
MTTF
CG
MTTF
MTTF
n
Disk
n
DiskDisk
RAID
*)1(*)*(
)(
1
*
*)1(*
2
−++
=
−++
=
MTTRCGGCD
MTTFMTTF
n
Disk
RAID
*)1(*)*(
)( 2
−++
=
Since the formula is the same for each level, we make
the abstract numbers concrete using these parameters as
appropriate: D=100 total data disks, G=10 data disks per
group, MTTFDisk = 30,000 hours, MTTR = 1 hour, with the
check disks per group C determined by the RAID level.
Reliability Overhead Cost This is simply the extra
check disks, expressed as a percentage of the number of
data disks D. As we shall see below, the cost varies with
RAID level from 100% down to 4%.
Useable Storage Capacity Percentage Another way to
express this reliability overhead is in terms of the
percentage of the total capacity of data disks and check
disks that can be used to store data. Depending on the
organization, this varies from a low of 50% to a high of
96%.
Performance Since supercomputer applications and
transaction-processing systems have different access
patterns and rates, we need different metrics to evaluate
both. For supercomputers we count the number of reads
and writes per second for large blocks of data, with large
defined as getting at least one sector from each data disk
in a group. During large transfers all the disks in a group
act as a single unit, each reading or writing a portion of the
large data block m parallel.
A better measure for transaction-processing systems is
the number of individual reads or writes per second. Since
transaction-processing systems (e.g., debits/credits) use a
read-modify-write sequence of disk accesses, we include
that metric as well. Ideally during small transfers each disk
in a group can act independently, either reading or writing
independent information. In summary supercomputer
applications need a high data rate while transaction-
processing need a high I/O rate.
diskdeadtherepairingbefore
groupainfailureanotherofobabilityCG
MTTF
MTTF DiskGroup Pr
1
*
+
=
)1/()1/(
Pr
−+
=
−
=
CGMTTF
MTTR
DisksNoMTTR
MTTR
FailureAnotherofobability
DiskDisk
Chapter 1 A Case for Redundant Arrays of Inexpensive Disks (RAID) 7
For both the large and small transfer calculations we
assume the minimum user request is a sector, that a sector
is small relative to a track, and that there is enough work
to keep every device busy. Figure 1.2 shows the ideal
operation of large and small disk accesses in a RAID.
Figure 1.2 Large transfer vs. small transfer
in a group of G disks
The six performance metrics are then the number of
reads, writes, and read-modify-writes per second for both
large (grouped) or small (individual) transfers. Rather than
give absolute numbers for each metric, we calculate
efficiency the number of events per second for a single
disk. (This is Boral’s I/O bandwidth per gigabyte [Boral
83] scaled to gigabytes per disk). In this paper we are after
fundamental differences so we use simple, deterministic
throughput measures for our performance metric rather
than latency.
Effective Performance Per Disk The cost of disks can
be a large portion of the cost of a database system, so the
I/O performance per disk--factoring in the overhead of the
check disks--suggests the cost/performance of a system.
This is the bottom line for a RAID.
1.7 First Level RAID: Mirrored Disks
Mirrored disks are a traditional approach for improving
reliability of magnetic disks. This is the most expensive
option since all disks are duplicated (G=l and C=l), and
every write to a data disk is also a write to a check disk.
Tandem, doubles the number of controllers for fault
tolerance, allowing an optimized version of mirrored disks
that let reads occur in parallel. Table 1.2I shows the
metrics for a Level 1 RAID assuming this optimization.
Table 1.2 Characteristics of Level 1 RAID. Here we
assume that writes are not slowed by waiting for the
second write to complete because the slowdown for
writing 2 disks is minor compared to the slowdown S for
writing a whole group of 10 to 25 disks. Unlike a "pure"
mirrored scheme with extra disks that is invisible to the
software, we assume an optimized scheme with twice as
many controllers allowing parallel reads to all disks,
giving full disk bandwidth for large reads and allowing the
reads of the read-modify-writes can occur in parallel
When individual accesses are distributed across
multiple disks, average queueing, seek, and rotate delays
may differ from the single disk case. Although bandwidth
may be unchanged, it is distributed more evenly, reducing
variance in queueing delay and, if the disk load is not too
high, also reducing the expected queueing delay through
parallelism [Livny 87]. When many arms seek to the same
track then rotate to the described sector, the average seek
and rotate time will be larger than the average for a single
disk, tending toward the worst case times. This affect
should not generally more than double the average access
time to a single sector while still getting many sectors in
parallel. In the special case of mirrored disks with
sufficient controllers, the choice between arms that can
read any data sector will reduce the time for average read
seek by up to 45% [Bitton 88].
To allow for these factors but to retain our fundamental
emphasis we apply a slowdown factor, S, whe
本文档为【A Case for Redundant Arrays of Inexpensive】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。