首页 A Case for Redundant Arrays of Inexpensive

A Case for Redundant Arrays of Inexpensive

举报
开通vip

A Case for Redundant Arrays of Inexpensive Chapter 1 3 A Case for Redundant Arrays of Inexpensive Disks (RAID) DAVID A. PATTERSON, GARTH GIBSON, AND RANDY H. KATZ Computer Science Division Department of Electrical Engineering and Computer Sciences 571 Evans Hall University of California Berkeley...

A Case for Redundant Arrays of Inexpensive
Chapter 1 3 A Case for Redundant Arrays of Inexpensive Disks (RAID) DAVID A. PATTERSON, GARTH GIBSON, AND RANDY H. KATZ Computer Science Division Department of Electrical Engineering and Computer Sciences 571 Evans Hall University of California Berkeley, CA 94720 Abstract Increasing performance of CPUs and memories will be squandered if not matched by a similar performance increase in I/O. While the capacity of Single Large Expensive Disk (SLED) has grown rapidly, the performance improvement of SLED has been modest. Redundant Arrays of Inexpensive Disks (RAID), based on the magnetic disk technology developed for personal computers, offers an attractive alternative to SLED, promising improvements of an order of magnitude in performance, reliability, power consumption, and scalability. This paper introduces five levels of RAIDs, giving their relative cost/performance, and compares RAIDs to an IBM 3380 and a Fujitsu Super Eagle. 1.1 Background: Rising CPU and Memory Performance The users of computers are currently enjoying unprecedented growth in the speed of computers. Gordon Bell said that between 1974 and 1984, single chip computers improved in performance by 40% per year, about twice the rate of minicomputers [Bell 84]. In the following year Bill Joy predicted an even faster growth [Joy 85] MIPS = 2Year-1984 Mainframe and supercomputer manufacturers, having difficulty keeping pace with this rapid growth predicted by Reprint from Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.109-116, June 1988 “Joy's Law”, cope by offering multiprocessors as their top- of-the-line product. But a fast CPU does not a fast system make. Gene Amdahl related CPU speed to main memory size using this rule [Siewiorek 82] Each CPU instruction per second requires one byte of main memory. If computer system costs are not to be dominated by the cost of memory, then Amdahl's constant suggests that memory chip capacity should grow at the same rate. Gordon Moore predicted that growth rate over 20 years ago transistors/chip = 2Year-1964 As predicted by Moore’s Law, RAMs have quadrupled in capacity every two [Moore 75] to three years [Moore 86]. Recently this ratio of megabytes of main memory to MIPS has been defined as alpha [Garcia 84], with Amdahl's constant meaning alpha = 1. In part because of the rapid drop of memory prices, main memory sizes have grown faster than CPU speeds and many machines are shipped today with alphas of 3 or higher. To maintain the balance of costs in computer systems, secondary storage must match the advances in other parts of the system. A key measure of disk technology is the growth in the maximum number of bits that can be stored per square inch, or the bits per inch in a track times the number of tracks per inch Called M A D, for maximal areal density, the “First Law in Disk Density” predicts [Frank87] MAD = 10(Year-1971)/10 Magnetic disk technology has doubled capacity and halved price every three years, in line with the growth rate of semiconductor memory, and in practice between 1967 and 1979 the disk capacity of the average IBM data processing system more than kept up with its main memory [Stevens81]. Capacity is not the only memory characteristic that must grow rapidly to maintain system balance, since the speed with which instructions and data are delivered to a wangxinalex 高亮 wangxinalex 高亮 wangxinalex 高亮 wangxinalex 高亮 wangxinalex 高亮 wangxinalex 高亮 wangxinalex 高亮 wangxinalex 高亮 wangxinalex 高亮 wangxinalex 高亮 wangxinalex 高亮 wangxinalex 高亮 4 Part I Introduction to Redundant Disk Array Architecture CPU also determines its ultimate performance. The speed of main memory has kept pace for two reasons: (1) the invention of caches, showing that a small buffer can be managed automatically to contain a substantial fraction of memory references; (2) and the SRAM technology, used to build caches, whose speed has improved at the rate of 40% to 100% per year. In contrast to primary memory technologies, the performance of single large expensive magnetic disks (SLED) has improved at a modest rate. These mechanical devices are dominated by the seek and the rotation delays: from 1971 to 1981, the raw seek time for a high-end IBM disk improved by only a factor of two while the rotation time did not change [Harker81]. Greater density means a higher transfer rate when the information is found, and extra heads can reduce the average seek time, but the raw seek time only improved at a rate of 7% per year. There is no reason to expect a faster rate in the near future. To maintain balance, computer systems have been using even larger main memories or solid state disks to buffer some of the I/O activity. This may be a fine solution for applications whose I/O activity has locality of reference and for which volatility is not an issue, but applications dominated by a high rate of random requests for small pieces of data (such as transaction-processing) or by a low number of requests for massive amounts of data (such as large simulations running on supercomputers) are facing a serious performance limitation. 1.2 The Pending I/O Crisis What is the impact of improving the performance of some pieces of a problem while leaving others the same? Amdahl's answer is now known as Amdahl's Law [Amdahl67] kffS /)1( 1 +− = where: S = the effective speedup, f = fraction of work in faster mode, and k = speedup while in faster mode. Suppose that some current applications spend 10% of their time in I/O. Then when computers are 10X faster - according to Bill Joy in just over three years--then Amdahl's Law predicts effective speedup will be only 5X. When we have computers 100X faster - via evolution of uniprocessors or by multiprocessors - this application will be less than 10X faster, wasting 90% of the potential speedup. While we can imagine improvements in software file systems via buffering for near term I/O demands, we need innovation to avoid an I/O crisis [Boral 83]. 1.3 A Solution: Arrays of Inexpensive Disks Rapid improvements in capacity of large disks have not been the only target of disk designers, since personal computers have created a market for inexpensive magnetic disks. These lower cost disks have lower performance as well as less capacity. Table 1.1 below compares the top- of-the-line IBM 3380 model AK4 mainframe disk, Fujitsu M2361A "Super Eagle" minicomputer disk, and the Conner Peripherals CP 3100 personal computer disk. Table 1.1 Comparison of IBM 3380 disk model AK4 for mainframe computers, the Fujitsu M2361A "Super Eagle" disk for minicomputers, and the Conners Peripherals CP 3100 disk for personal computers. By "Maximum I/O's/second" we mean the maximum number of average seeks and average rotates for a single sector access. Cost and reliability information on the 3380 comes from widespread experience [IBM 87] [Gawlick87] and the information on the Fujitsu from the manual [Fujitsu 87], while some numbers on the new CP3100 are based on speculation. The price per megabyte is given as a range to allow for different prices for volume discount and different mark-up practices of the vendors. (The 8 watt maximum power of the CP3100 was increased to 10 watts to allow for the inefficiency of an external power supply (since the other drives contain their own power supplies). One surprising fact is that the number of I/Os per second per actuator in an inexpensive disk is within a factor of two of the large disks. In several of the remaining metrics, including price per megabyte, the inexpensive disk is superior or equal to the large disks. The small size and low power are even more impressive since disks such as the CP3100 contain full track buffers and most functions of the traditional mainframe controller. Small disk manufacturers can Administrator 高亮 Chapter 1 A Case for Redundant Arrays of Inexpensive Disks (RAID) 5 provide such functions in high volume disks because of the efforts of standards committees in defining higher level peripheral interfaces, such as the ANSI X3.131-1986 Small Computer Synchronous Interface (SCSI). Such standards have encouraged companies like Adaptec to offer SCSI interfaces as single chips, in turn allowing disk companies to embed mainframe controller functions at low cost. Figure 1.1 compares the traditional mainframe disk approach and the small computer disk approach. The same SCSI interface chip embedded as a controller in every disk can also be used as the direct memory access (DMA) device at the other end of the SCSI bus. Figure 1.1 Comparison of organizations for typical mainframe and small computer disk interface. Single chip SCSI interfaces such as the Adaptec AIC-6250 allow the small computer to use a single chip to be the DMA interface as well as provide an embedded controller for each disk [Adaptec87] (The price per megabyte in Table 1.1 includes everything in the shaded boxes above). Such characteristics lead to the proposal of building I/O systems as arrays of inexpensive disks, either interleaved for the large transfers of supercomputers [Kim 86][Livny 87][Salem 86] or independent for the many small transfers of transaction processing. Using the information in Table I, 75 inexpensive disks potentially have 12 times the I/O bandwidth of the IBM 3380 and the same capacity, with lower power consumption and cost. 1.4 Caveats We cannot explore all issues associated with such arrays in the space available for this paper, so we concentrate on the price-performance and reliability. Our reasoning is that there are no advantages in price-performance or terrible disadvantages in reliability, then there is no need to explore further. We characterize the transaction- processing workload to evaluate performance of a collection of inexpensive disks, but remember that such a collection is just one hardware component of a complete transaction-processing system. While designing a complete TPS based on these ideas is enticing, we will resist that temptation in this paper. Cabling and packaging, certainly an issue in the cost and reliability of an array of many inexpensive disks, is also beyond this paper's scope. 1.5 And Now The Bad News: Reliability The unreliability of disks forces computer systems managers to make backup versions of information quite frequently in case of failure. What would be the impact on reliability of having a hundredfold increase in disks? Assuming a constant failure rate--that is, an exponentially distributed time to failure - and that failures are independent--both assumptions made by disk manufacturers when calculating the Mean Time To Failure (MTTF) - the reliability of an array of disks is: ArraytheinDisksofNumber DiskSingleaofMTTFArrayDiskaofMTTF = Using the information in Table 1.1, the MTTF of 100 CP 3100 disks is 30,000/100 = 300 hours, or less than 2 weeks. Compared to the 30,000 hour (> 3 years) MTTF of the IBM 3380, this is dismal. If we consider scaling the array to 1000 disks, then the MTTF is 30 hours or about one day, requiring an adjective worse than dismal. Without fault tolerance, large arrays of inexpensive disks are too unreliable to be useful. 1.6 A Better Solution: RAID To overcome the reliability challenge, we must make use of extra disks containing redundant information to recover the original information when a disk fails. Our acronym for these Redundant Arrays of Inexpensive Disks is RAID. To simplify the explanation of our final proposal and to avoid confusion with previous work, we give the taxonomy of five different organizations of disk arrays, beginning with mirrored disks and progressing through a variety of alternatives with differing performance and reliability. We refer to each organization as a RAID level. The reader should be forewarned that we describe all levels as if implemented in hardware solely to simplify the presentation, for RAID ideas are applicable to software implementations as well as hardware. Reliability Our basic approach will be to break the arrays into reliability groups, with each group having extra "check" disks containing the redundant information. When a disk fails we assume that within a short time the failed disk will be replaced and the information will be reconstructed on to the new disk using the redundant information. This time is called the mean time to repair 6 Part I Introduction to Redundant Disk Array Architecture (MTTR). The MTTR can be reduced if the system includes extra disks to act as "hot" standby spares; when a disk fails, a replacement disk is switched in electronically. Periodically the human operator replaces all failed disks. Here are some other terms that we use: D = total number of disks with data (not including the extra check disks); G = number of data disks in a group (not including the extra check disks); C = number of check disks in a group; nG = D/G = number of groups. As mentioned above we make the same assumptions that the disk manufacturers make--that the failures are exponential and independent. (An earthquake or power surge is a situation where an array of disks might not fail independently.) Since these reliability predictions will be very high, we want to emphasize that the reliability is only of the disk-head assembles with this failure model, and not the whole software and electronic system. In addition, in our view the pace of technology means extremely high MTTF are “overkill” - for, independent of expected lifetime, users will replace obsolete disks. After all, how many people are still using 20 years old disks? The general MTTF calculation for single-error repairing RAID is given in two steps. First, the group MTTF is As more formally derived in the appendix, the probability of a second failure before the first has been repaired is The intention behind the formal calculation in the appendix comes from trying to calculate the average number of second disk failure during the repair time for X single disk failures. Since we assume that disk failure occur at a uniform rate, this average number of second failure during the repair time for X first failure is grouptheindisksremainingofMTTF MTTRX * The average number of second failure for a single disk is then grouptheindisksremainingofNoMTTF MTTR Disk / The MTTF of the remaining disks is just the MTTF of a single disk divided by the number of good disks in the group, giving the result above. The second step is the reliability of the whole system, which is approximately (since MTTFGroup is not quite distributed exponentially) G MTTF MTTF n Group RAID = Plugging it all together, we get: MTTRCGGCG MTTF GMTTRCG MTTF CG MTTF MTTF n Disk n DiskDisk RAID *)1(*)*( )( 1 * *)1(* 2 −++ = −++ = MTTRCGGCD MTTFMTTF n Disk RAID *)1(*)*( )( 2 −++ = Since the formula is the same for each level, we make the abstract numbers concrete using these parameters as appropriate: D=100 total data disks, G=10 data disks per group, MTTFDisk = 30,000 hours, MTTR = 1 hour, with the check disks per group C determined by the RAID level. Reliability Overhead Cost This is simply the extra check disks, expressed as a percentage of the number of data disks D. As we shall see below, the cost varies with RAID level from 100% down to 4%. Useable Storage Capacity Percentage Another way to express this reliability overhead is in terms of the percentage of the total capacity of data disks and check disks that can be used to store data. Depending on the organization, this varies from a low of 50% to a high of 96%. Performance Since supercomputer applications and transaction-processing systems have different access patterns and rates, we need different metrics to evaluate both. For supercomputers we count the number of reads and writes per second for large blocks of data, with large defined as getting at least one sector from each data disk in a group. During large transfers all the disks in a group act as a single unit, each reading or writing a portion of the large data block m parallel. A better measure for transaction-processing systems is the number of individual reads or writes per second. Since transaction-processing systems (e.g., debits/credits) use a read-modify-write sequence of disk accesses, we include that metric as well. Ideally during small transfers each disk in a group can act independently, either reading or writing independent information. In summary supercomputer applications need a high data rate while transaction- processing need a high I/O rate. diskdeadtherepairingbefore groupainfailureanotherofobabilityCG MTTF MTTF DiskGroup Pr 1 * + = )1/()1/( Pr −+ = − = CGMTTF MTTR DisksNoMTTR MTTR FailureAnotherofobability DiskDisk Chapter 1 A Case for Redundant Arrays of Inexpensive Disks (RAID) 7 For both the large and small transfer calculations we assume the minimum user request is a sector, that a sector is small relative to a track, and that there is enough work to keep every device busy. Figure 1.2 shows the ideal operation of large and small disk accesses in a RAID. Figure 1.2 Large transfer vs. small transfer in a group of G disks The six performance metrics are then the number of reads, writes, and read-modify-writes per second for both large (grouped) or small (individual) transfers. Rather than give absolute numbers for each metric, we calculate efficiency the number of events per second for a single disk. (This is Boral’s I/O bandwidth per gigabyte [Boral 83] scaled to gigabytes per disk). In this paper we are after fundamental differences so we use simple, deterministic throughput measures for our performance metric rather than latency. Effective Performance Per Disk The cost of disks can be a large portion of the cost of a database system, so the I/O performance per disk--factoring in the overhead of the check disks--suggests the cost/performance of a system. This is the bottom line for a RAID. 1.7 First Level RAID: Mirrored Disks Mirrored disks are a traditional approach for improving reliability of magnetic disks. This is the most expensive option since all disks are duplicated (G=l and C=l), and every write to a data disk is also a write to a check disk. Tandem, doubles the number of controllers for fault tolerance, allowing an optimized version of mirrored disks that let reads occur in parallel. Table 1.2I shows the metrics for a Level 1 RAID assuming this optimization. Table 1.2 Characteristics of Level 1 RAID. Here we assume that writes are not slowed by waiting for the second write to complete because the slowdown for writing 2 disks is minor compared to the slowdown S for writing a whole group of 10 to 25 disks. Unlike a "pure" mirrored scheme with extra disks that is invisible to the software, we assume an optimized scheme with twice as many controllers allowing parallel reads to all disks, giving full disk bandwidth for large reads and allowing the reads of the read-modify-writes can occur in parallel When individual accesses are distributed across multiple disks, average queueing, seek, and rotate delays may differ from the single disk case. Although bandwidth may be unchanged, it is distributed more evenly, reducing variance in queueing delay and, if the disk load is not too high, also reducing the expected queueing delay through parallelism [Livny 87]. When many arms seek to the same track then rotate to the described sector, the average seek and rotate time will be larger than the average for a single disk, tending toward the worst case times. This affect should not generally more than double the average access time to a single sector while still getting many sectors in parallel. In the special case of mirrored disks with sufficient controllers, the choice between arms that can read any data sector will reduce the time for average read seek by up to 45% [Bitton 88]. To allow for these factors but to retain our fundamental emphasis we apply a slowdown factor, S, whe
本文档为【A Case for Redundant Arrays of Inexpensive】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_803341
暂无简介~
格式:pdf
大小:2MB
软件:PDF阅读器
页数:0
分类:互联网
上传时间:2012-10-18
浏览量:14