The R+-tree

The R+-tree Carnegie Mellon University Research Showcase Computer Science Department School of Computer Science 9-1-1987 The R+-Tree: A Dynamic Index for Multi- Dimensional Objects Timos Sellis University of Maryland - College Park Nick Roussopoulos University of Mar...

Carnegie Mellon University Research Showcase Computer Science Department School of Computer Science 9-1-1987 The R+-Tree: A Dynamic Index for Multi- Dimensional Objects Timos Sellis University of Maryland - College Park Nick Roussopoulos University of Maryland - College Park Christos Faloutsos Carnegie Mellon University, christos@cs.cmu.edu Follow this and additional works at: http://repository.cmu.edu/compsci This Conference Proceeding is brought to you for free and open access by the School of Computer Science at Research Showcase. It has been accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase. For more information, please contact research- showcase@andrew.cmu.edu. Recommended Citation Sellis, Timos; Roussopoulos, Nick; and Faloutsos, Christos, "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects" (1987). Computer Science Department. Paper 566. http://repository.cmu.edu/compsci/566 THE R+R+-TREE: A DYNAMIC INDEX FOR MULTI-DIMENSIONAL OBJECTS† Timos Sellis1,2, Nick Roussopoulos1,2 and Christos Faloutsos2 Department of Computer Science University of Maryland College Park, MD 20742 Abstract The problem of indexing multidimensional objects is considered. First, a classification of existing methods is given along with a discussion of the major issues involved in multidimensional data indexing. Second, a variation to Guttman’s R-trees (R+-trees) that avoids overlapping rectangles in intermediate nodes of the tree is introduced. Algorithms for searching, updating, ini- tial packing and reorganization of the structure are dis- cussed in detail. Finally, we provide analytical results indicating that R+-trees achieve up to 50% savings in disk accesses compared to an R-tree when searching files of thousands of rectangles. 1 Also with University of Maryland Systems Research Center. 2 Also with University of Maryland Institute for Advanced Computer Studies (UMIACS). † This research was sponsored partialy by the National Science Foun- dation under Grant CDR-85-00108. 1. Introduction It has been recognized in the past that existing Data- base Management Systems (DBMSs) do not handle efficiently multi-dimensional data such as boxes, polygons, or even points in a multi-dimensional space. Multi-dimensional data arise in many applications, to name the most important: (1) Cartography. Maps could be stored and searched electronically, answering efficiently geometric queries [3] [19]. (2) Computer-Aided Design (CAD). For example, VLSI design systems need to store many thousands of rectangles [15] [9], representing electronic gates and higher level elements. (3) Computer vision and robotics. (4) Rule indexing in expert database systems [22]. In this proposal rules are stored as geometric entities in some multi-dimensional space defined over the database. Then, the problem of searching for applicable rules is reduced to a geometric intersection problem. Since database management systems can be used to store one-dimensional data, like integer or real numbers and strings, considerable interest has been developed in using DBMSs to store multi-dimensional data as well. In that sense the DBMS can be the single means for storing and accessing any kind of information required by applications more complex than traditional business applications. However, the underlying structures, data models and query languages are not sufficient for the manipulation of more complex data. The problem of extending current data models and languages has been considered by various people in the past [2], [21] [9] [19]. In this paper we focus on the problem of deriving efficient access methods for multi-dimensional objects. The main operations that have been addressed in the past are: Point Queries: Given a point in the space, find all objects that contain it Region Queries: Given a region (query win- dow), find all objects that intersect it Of course these operations can be augmented with addi- tional constraints on simple one-dimensional (scalar) data. In addition, operations like insertions, deletions and modifications of objects should be supported in a dynamic environment. The purpose of this paper is to describe a new struc- ture, the R+-tree. Section 2 surveys existing indexing methods and classifies them according to some criteria. Then, in sections 3 and 4 we describe R+-trees and the algorithms for searching, updating and packing the structure. Section 5 presents some preliminary analyti- cal results on the searching performance of the R+-tree, especially as it compares to the corresponding perfor- mance of R-trees [10]. Finally, we conclude in Section 6 by summarizing our contributions and giving hints for future research in the area of multi-dimensional data indexing structures. 2. Survey In this section we classify and briefly discuss known methods for handling multi-dimensional objects. Our main concern is the storage and retrieval of rectangles in secondary store (disk). Handling more complex objects, such as circles, polygons etc., can be reduced to handling rectangles, by finding the minimum bounding rectangle (MBR) of the given object. In our discussion, we shall first examine methods for handling multi- dimensional points, because these suggest many useful ideas applicable to rectangles as well. 2.1. Methods for multi-dimensional points The most common case of multi-dimensional data that has been studied in the past is the case of points. The main idea is to divide the whole space into disjoint sub-regions, usually in such a way that each sub-region contains no more than C points. C is usually 1 if the data is stored in core, or it is the capacity of a disk page, that is the number of data records the page can hold. Insertions of new points may result in further parti- tioning of a region, known as a split. The split is per- formed by introducing one (or more) hyperplanes that partition a region further, into disjoint sub-regions. The following attributes of the split help to classify the known methods: Position The position of the splitting hyperplane is pre- determined, e.g., it cuts the region in half exactly, as the grid file does [13]. We shall call these methods fixed. The opposite is to let the data points deter- mine the position of the hyperplane, as, e.g., the k-d trees [1] or the K-D-B-trees [17] do. We shall call these methods adaptable. Nievergelt et al. [13] made the same distinction, using different terminol- ogy: what we call "fixed" methods are those methods that organize the embedding space, from which the data is drawn, while they call the "adaptable" methods as methods that organize the data to be stored. Dimensionality the split is done with only one hyperplane (1-d cut), as in the k-d trees. The opposite is to split in all k dimensions, with k hyperplanes (k-d cut), as the quad-trees [7] and oct-trees do. Locality The splitting hyperplane splits not only the affected region, but all the regions in this direction, as well, like the grid file does. We shall call these methods grid methods. The opposite is to restrict the split- ting hyperplane to extend solely inside the region to be split. These methods will be referred to as brickwall methods. The brickwall methods usually do a hierarchical decomposition of the space, requir- ing a tree structure. The grid methods use a multi- dimensional array. The usefulness of the above classification is twofold: For one, it creates a general framework that puts all the known methods "on the map". The second reason is that it allows the design of new methods, by choosing the position, dimensionality and locality of the split, which might be suitable for a given application. Table 2.1 illustrates some of the most well-known methods and their attributes according to the above classification. Notice that methods based on binary trees or quad- trees cannot be easily extended to work in secondary storage based systems. The reason is that, since a disk page can hold of the order of 50 pointers, trees with nodes of large fanout are more appropriate; trees with two- or four-way nodes usually result in many (expen- sive) page faults. 2.2. Methods for rectangles Here we present a classification and brief discussion of methods for handling rectangles. The main classes of methods are the following: (1) Methods that transform the rectangles into points in a space of higher dimensionality [11]. For example, a 2-d rectangle (with sides parallel to the axes) is characterized by four coordinates, and thus it can be considered as a point in a 4-d ulululululululululululululululululululululululululululululululululululululululululululululululul Method Position Dimensions Localityulululululululululululululululululululululululululululululululululululululululululululululululul ulululululululululululululululululululululululululululululululululululululululululululululululul point quad-tree adaptable k-d brickwall ulululululululululululululululululululululululululululululululululululululululululululululululul k-d tree adaptable 1-d brickwall ulululululululululululululululululululululululululululululululululululululululululululululululul grid file fixed 1-d grid ulululululululululululululululululululululululululululululululululululululululululululululululul K-D-B-tree adaptable 1-d brickwall ululululululululululululululululululululululululululululululululululululululululululululululululc c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Table 2.1: Illustration of the classification. space. Therefore, one of the previously men- tioned methods for storing points can be chosen. Lauther [12] and Rosenberg [18] used k-d trees. Hinrichs and Nievergelt [11] suggested using the grid file, after a rotation of the axes. The rotation is necessary, in order to avoid non- uniform distribution of points, that would lead the grid file to poor performance. (2) Methods that use space filling curves, to map a k-d space onto a 1-d space. Such a method, suitable for a paged environment, has been sug- gested, among others, by Orenstein [14]. The idea is to transform k-dimensional objects to line segments, using the so-called z-transform. This transformation tries to preserve the dis- tance, that is, points that are close in the k-d space are likely to be close in the 1-d transformed space. Improved distance- preserving transformations have been proposed [4], which achieve better clustering of nearby points, by using Gray codes. The original z- transform induces an ordering of the k-d points, which is the very same one that a (k- dimensional) quad-tree uses to scan pixels in a k-dimensional space. The transformation of a rectangle is a set of line segments, each corresponding to a quadrant that the rectangle completely covers. (3) Methods that divide the original space into appropriate sub-regions (overlapping or dis- joint). If the regions are disjoint, any of the methods for points that we mentioned before, can be used to decompose the space. The only complication to be handled is that a rectangle may intersect a splitting hyperplane. One solu- tion is to cut the offending rectangle in two pieces and tag the pieces, to indicate that they belong to the same rectangle. Recently, Gun- ther [8] suggested a relevant scheme for gen- eral polygon data, either convex or concave. He suggests that the splitting hyperplanes can be of arbitrary orientation (not necessarily parallel to the axes). The first who proposed the use of overlapping sub-regions was Guttman with his R-Trees [10]. R-trees are an extension of B- trees for multi-dimensional objects that are either points or regions. Like B-trees, they are balanced (all leaf nodes appear on the same level, which is a desirable feature) and guaran- tee that the space utilization is at least 50%. However, if R-Trees are built using the dynamic insertion algorithms, the structure may provide excessive space overlap and "dead-space" in the nodes that result in bad performance. A pack- ing technique proposed in [19] alleviates this problem for relatively static databases of points. However, for update-intensive spatial databases, packing cannot be applied on every single inser- tion. In such an environment, the structure to be described in the next section (R+-trees) avoids the performance degradation caused by the overlapping regions. Space and time comparison of the above approaches is an interesting problem, which we are currently study- ing. As a first step, in section 5 we provide some analysis for the R- and R+- tree structures. 3. RR+-Trees In this section we introduce the R+-tree and discuss the algorithms for searching and updating the data struc- ture. 3.1. Description As mentioned above, R-trees are a direct extension of B-trees in k-dimensions. The data structure is a height-balanced tree which consists of intermediate and leaf nodes. Data objects are stored in leaf nodes and intermediate nodes are built by grouping rectangles at the lower level. Each intermediate node is associated with some rectangle which completely encloses all rec- tangles that correspond to lower level nodes. Figure 3.1 shows an example set of data rectangles and Figure 3.2 the corresponding R-tree built on these rectangles (assuming a branching factor of 4). Considering the performance of R-tree searching, the concepts of coverage and overlap [19] are impor- tant. Coverage of a level of an R-tree is defined as the total area of all the rectangles associated with the nodes A B C GF E D H I J K L M N Figure 3.1: Some rectangles organized into an R-tree NMLD CBA KJIHGFE Figure 3.2: R-tree for the rectangles of Figure 3.1 of that level. Overlap of a level of an R-tree is defined as the total area contained within two or more nodes. Obviously, efficient R-tree searching demands that both overlap and coverage be minimized. Minimal coverage reduces the amount of dead space (i.e. empty space) covered by the nodes. Minimal overlap is even more critical than minimal coverage. For a search window falling in the area of k overlapping nodes at level h −l, with h being the height of the tree, in the worst case, k paths to the leaf nodes have to be followed (i.e. one from each of the overlapping nodes), therefore slowing down the search from l to l k page accesses. For exam- ple, for the search window W shown in Figure 3.3, both subtrees rooted at nodes A and B must be searched although only the latter will return a qualifying rectan- gle. The cost of such an operation would be one page access for the root and two additional page accesses to check the rectangles stored in A and B. Clearly, since it is very hard to control the overlap during the dynamic A B C GF E D H I J K L M N W Figure 3.3: An example of a "bad" search window splits of R-trees, efficient search degrades and it may even degenerate the search from logarithmic to linear. It has been shown, that zero overlap and coverage is only achievable for data points that are known in advance and, that using a packing technique for R-trees, search is dramatically improved [19]. In the same paper it is shown that zero overlap is not attainable for region data objects. However, if we allow partitions to split rectangles then zero overlap among intermediate node entries can be achieved. This is the main idea behind the R+-tree structure. Figure 3.4 indicates a different grouping of the rectangles of Figure 3.1 and Figure 3.5 shows the corresponding R+-tree. Notice that rectangle G has been split into two sub- rectangles the first contained in node A and the second in P. That is, whenever a data rectangle at a lower level overlaps with another rectangle, we decompose it into a collection of non-overlapping sub-rectangles whose union makes up the original rectangle. The term "data A B C GF E D H I J K L M N P Figure 3.4: The rectangles of Figure 3.1 grouped to form an R+-tree NMLD CBA GFE P KJI HG Figure 3.5: The R+-tree built for Figure 3.4 rectangle" denotes a rectangle that is the minimum bounding rectangle of an object (as opposed to rectan- gles that correspond to intermediate nodes of the tree). Avoiding overlap is achieved at the expense of space which increases the height of the tree. However, because the space increase is logarithmically distributed over the tree, the indirect increment of the height is more than offset by the benefit of searching multiple shorter paths. For example, if we consider again the cost for a search operation based on the window W of Figure 3.3 we notice that only the root of the tree and node P need be accessed, thus saving us one out of three page accesses. R+-trees can be thought as an extension of K-D-B- trees to cover non-zero area objects (i.e. not only points but rectangles as well). An improvement over the K- D-B-trees is the reduced coverage; the nodes of a given level do not necessarily cover the whole initial space. Moreover, compared to R-trees, R+-trees exhibit very good searching performance, especially for point queries, at the expense of some extra space. See section 5 for analytical results supporting the above discussion. After this brief discussion to motivate the introduc- tion of R+-trees we move now to formally describe the structure. A leaf node is of the form (oid, RECT ) where oid is an object identifier and is used to refer to an object in the database. RECT is used to describe the bounds of data objects. For example, in a 2- dimensional space, an entry RECT will be of the form (xlow ,xhigh ,ylow ,yhigh) which represents the coordinates of the lower-left and upper-right corner of the rectangle. An intermediate node is of the form (p, RECT ) where p is a pointer to a lower level node of the tree and RECT is a representation of the rectangle that encloses. The R+-tree has the following properties: (1) For each entry (p, RECT ) in an intermediate node, the subtree rooted at the node pointed to by p contains a rectangle R if and only if R is covered by RECT. The only exception is when R is a rectangle at a leaf node; in that case R must just overlap with RECT. (2) For any two entries (p 1,RECT1) and (p 2,RECT2) of an intermediate node, the over- lap between RECT1 and RECT2 is zero. (3) The root has at least two children unless it is a leaf. (4) All leaves are at the same level. Let us assume that M is the maximum number of entries that can fit in a leaf or intermediate node. Notice that one property satisfied by an R-tree but not an R+-tree is that in the former every leaf node contains between M/2 and M entries and each intermediate node contains between M/2 and M nodes unless it is the root. K-D- B-trees do not satisfy this property either. However, Robinson showed with his experimental results that storage utilization in K-D-B-trees remains in acceptable levels (60%, which is only 10% below the average B- tree utilization). Although, R-trees achieve better space utilization at the expense of search performance we believe that 10% degradation is a minimal price to pay for the the search improvement obtained in R+-trees (see section 5). Another interesting comment here is due to the fact that populating the nodes as much as possible will result to a decrease in the height of the tree at the expense of more costly updates. Therefore another parameter of the problem should be the initial packing algorithm used to populate an R+-tree and its reorganization techniques. In the following we discuss the algorithms for searching and updating an R+-tree. Section 4 presents the packing algorithm. 3.2. Searching The searching algorithm is similar to the one used in R-trees. The idea is to first decompose the search space into disjoint sub-regions and for each of those descend the tree until the actual data objects are found in the leaves. Notice

                    本文档为【The R+-tree】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，
                    图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。 
 该文档来自用户分享，如有侵权行为请发邮件ishare@vip.sina.com联系网站客服，我们会及时删除。

                    [版权声明] 本站所有资料为用户分享产生，若发现您的权利被侵害，请联系客服邮件isharekefu@iask.cn，我们尽快处理。

                    本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权，请谨慎使用。

                    网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传，仅限个人学习分享使用，禁止用于任何广告和商用目的。
                

下载需要：免费已有0 人下载

立即下载

The R+-tree

你可能还喜欢