首页 Data and Spatial Weights in spdep

Data and Spatial Weights in spdep

举报
开通vip

Data and Spatial Weights in spdep Data and Spatial Weights in spdep Notes and Illustrations Luc Anselin University of Illinois, Urbana-Champaign http://sal.agecon.uiuc.edu July 11, 2003 Purpose The purpose of this set of notes is to illustrate how to bring data and spatial weights informa...

Data and Spatial Weights in spdep
Data and Spatial Weights in spdep Notes and Illustrations Luc Anselin University of Illinois, Urbana-Champaign http://sal.agecon.uiuc.edu July 11, 2003 Purpose The purpose of this set of notes is to illustrate how to bring data and spatial weights information from GeoDa into R, for use by the spdep package. In addition, several of the weights construction and manipulation functions in spdep are reviewed. Both built-in spdep functions as well as a smaller number of customized R functions will be used and illustrated. For details on GeoDa functionality, see the GeoDa User’s Guide and Tutorials at http://sal.agecon.uiuc.edu/stuff. Further details on the operations of spdep can be found in the original spdep documentation and in the “Introduction to Spatial Regression Analysis in R” tutorial on the SAL web site. The various tasks will be illustrated with a sample data set. It is assumed that you will try to replicate this with your own data and/or with another one of the sample data sets. To get started, make sure the spdep package is loaded, with library(spdep) Converting GeoDa data to a data frame Data in R are most usefully stored in a data frame. The helper function read.geoda uses the built-in read.csv function with some preset options to make sure it matches the format of the data exported by GeoDa. For example, load the BALTIM.SHP point data set into GeoDa and use Tools > Data Export > ASCII. Specify the input shape file and give a name for the text file that will contain the exported data (such as baltim.txt). You will have exported a file that looks like Figure 1. Figure 1. Contents of baltim.txt, exported by GeoDa 2 Two things are non-standard as far as the usual R function read.table goes. One is the extra line in the front (with number of observations and number of variables), the other is the use of a comma as data separator. You can deal with these by explicitly setting the “skip” option to 1 and using read.csv instead of the generic read.table. For example: > balt <- read.csv("baltim.txt",header=TRUE,skip=1) > summary(balt) STATION PRICE NROOM DWELL Min. : 1.0 Min. : 3.50 Min. : 3.000 Min. :0.0000 1st Qu.: 53.5 1st Qu.: 30.95 1st Qu.: 5.000 1st Qu.:0.0000 Median :106.0 Median : 40.00 Median : 5.000 Median :1.0000 Mean :106.0 Mean : 44.31 Mean : 5.199 Mean :0.5355 3rd Qu.:158.5 3rd Qu.: 53.75 3rd Qu.: 6.000 3rd Qu.:1.0000 Max. :211.0 Max. :165.00 Max. :10.000 Max. :1.0000 etc… You can also explicitly specify a record identifier by means of the row.names option. For the BALTIM.TXT data, this would be STATION: >balt3 <- read.csv("baltim.txt",header=TRUE,skip=1,row.names="STATION") As a result of this specification, the STATION column is no longer part of the data frame as a separate variable, it is now used explicitly as the row indicator. > summary(balt3) PRICE NROOM DWELL NBATH Min. : 3.50 Min. : 3.000 Min. :0.0000 Min. :1.000 1st Qu.: 30.95 1st Qu.: 5.000 1st Qu.:0.0000 1st Qu.:1.000 Median : 40.00 Median : 5.000 Median :1.0000 Median :1.500 Mean : 44.31 Mean : 5.199 Mean :0.5355 Mean :1.573 3rd Qu.: 53.75 3rd Qu.: 6.000 3rd Qu.:1.0000 3rd Qu.:2.000 Max. :165.00 Max. :10.000 Max. :1.0000 Max. :5.000 etc… These options are incorporated in the read.geoda helper function. You “compile” this function with the R source command. Make sure you copy the file read.geoda.R to your working directory. Then “source” it as > source("read.geoda.R") The read.geoda function (drop the R file extension) is now added to your workspace and can be invoked directly. For example, to create a data frame without a row identifier: > balt00 <- read.geoda("baltim.txt") > summary(balt00) STATION PRICE NROOM DWELL Min. : 1.0 Min. : 3.50 Min. : 3.000 Min. :0.0000 1st Qu.: 53.5 1st Qu.: 30.95 1st Qu.: 5.000 1st Qu.:0.0000 Median :106.0 Median : 40.00 Median : 5.000 Median :1.0000 Mean :106.0 Mean : 44.31 Mean : 5.199 Mean :0.5355 3rd Qu.:158.5 3rd Qu.: 53.75 3rd Qu.: 6.000 3rd Qu.:1.0000 3 Max. :211.0 Max. :165.00 Max. :10.000 Max. :1.0000 etc… Specifying STATION as the row identifier: > balt01 <- read.geoda("baltim.txt","STATION") > summary(balt01) PRICE NROOM DWELL NBATH Min. : 3.50 Min. : 3.000 Min. :0.0000 Min. :1.000 1st Qu.: 30.95 1st Qu.: 5.000 1st Qu.:0.0000 1st Qu.:1.000 Median : 40.00 Median : 5.000 Median :1.0000 Median :1.500 Mean : 44.31 Mean : 5.199 Mean :0.5355 Mean :1.573 3rd Qu.: 53.75 3rd Qu.: 6.000 3rd Qu.:1.0000 3rd Qu.:2.000 Max. :165.00 Max. :10.000 Max. :1.0000 Max. :5.000 etc… In order to be able to refer to the variables in a data frame directly by their name, instead of using framename$variablename, you can attach the data frame. For example, if you use the mean command with X, you get an error message: > mean(X) Error in mean(X) : Object "X" not found After attaching the data frame, this command works fine: > attach(balt01) > mean(X) [1] 911.646 After you are done using a given data frame, it is good practice to detach it: > detach(balt01) This frees up work space and avoids possible naming conflicts. Converting GAL weights to neighbor objects – Old Format, Sequence Numbers Contiguity information on spatial objects is contained in a neighbor object (nb) in spdep. This contiguity information can be constructed for polygon shape files using the Create Weights function in GeoDa. The resulting GAL file (text file with file extension .gal) can be read into spdep by means of the read.gal function. However, care must be taken with the newer format used by GeoDa. Before delving into that, consider the “old” format case, where the first line in the file gives the number of observations. The remainder of the file gives, for each observation, the observation ID, the number of neighbors and then on a second line the IDs for the neighbors. For now, we will use a simple sequence number as an ID. For example, using GeoDa to create a set of Thiessen polygons for the BALTIM point data set (see the GeoDa User’s Guide, pp. 26-28), you can create a “rook” type contiguity weights file. Leave the entry for the ID variable untouched, as in Figure 2. This will use the sequence numbers as identifiers. 4 Figure 2. Creating a weights file without explicit ID variable. This is usually dangerous, unless you can guarantee that the record sequence in the internal storage in GeoDa will be the same as in the external data set. As long as you create the data frame using the exported file from GeoDa, based on the same shape file, things should be OK, but it is “living dangerously.” The explicit designation of an ID variable is much preferred. After creating the GAL file using this procedure, the format of the file is as in Figure 3. The text file has a header line with the number of observations, followed by the contiguity information, going from observation 1 to 211, in sequence. Figure 3. GAL weights file without ID variable. The standard read.gal function in spdep assumes that the contiguity information is given in sequence. This has been generalized in the most recent version of read.gal 5 included in the helper files, which is considered in the next section. To have access to the full range of options, it is best to copy the file read.gal3.R to your working directory and source it into R: > source("read.gal3.R") Note that this will replace the original read.gal function included in the spdep package with a more recent (and slightly more flexible) version. It will also add some new functions, such as read.gal2 and read.gal3. The read.gal function has two parameters: the file name (include the GAL file extension), and an option to specify observation ids. The spdep manual refers to these as region ids, which is the matching attribute of the nb object (“region.id”). When no region ID is specified, sequence numbers are assigned. Note that the region ID must be a character vector that is already in the work space (through attaching a data frame, for example). Also, this is not the same as a general identifier for the observations and their neighbors, since its value is not extracted from the GAL file, but taken from the data frame. In the neighbor object, it is added as an additional attribute, in addition to the contiguity information. It is used in the summary function, but is not part of the contiguity information. In contrast to the original read.gal function in spdep version 0.1-10, the newer function can read GAL files with both the old header (only the number of observations) and the new header (four entries). However, it does require that the contiguity entries are in the same sequence as the observations in the data frame. Using the baltpolold.GAL file as the input, a neighbor object is constructed as > baltnb1 <- read.gal("baltpolold.GAL") and its characteristics can be obtained with the summary function. Note how the $region.id attribute is chr[1:221], a list of sequential integer characters. The summary characteristics describe the connectedness structure of the spatial weights. The “link number distribution” contains the same information as the histogram of weights characteristics in GeoDa (pp. 86-87 in the GeoDa User’s Guide). > summary(baltnb1) Connectivity of baltnb1 with the following attributes: List of 5 $ class : chr "nb" $ region.id: chr [1:211] "1" "2" "3" "4" ... $ gal : logi TRUE $ call : logi TRUE $ sym : logi TRUE NULL Number of regions: 211 Number of nonzero links: 1198 Percentage nonzero weights: 2.690865 Average number of links: 5.677725 Link number distribution: 6 3 4 5 6 7 8 9 11 7 21 66 75 29 9 3 1 7 least connected regions: 3 6 48 97 144 157 199 with 3 links 1 most connected region: 208 with 11 links In order to explicitly specify a region.id variable, you need to make things a bit more interesting, since STATION is a simple sequential numbering. In GeoDa, you can use the table calculation functions to create a new variable, NEWID, for example as 1000 + STATION. Save this as a new shape file, say baltim2.shp. Now turn baltim2.shp into a new data frame, first exporting the data to baltim2.txt, then reading it into a table: > newbalt <- read.geoda("baltim2.txt") > summary(newbalt) STATION PRICE NROOM DWELL Min. : 1.0 Min. : 3.50 Min. : 3.000 Min. :0.0000 1st Qu.: 53.5 1st Qu.: 30.95 1st Qu.: 5.000 1st Qu.:0.0000 Median :106.0 Median : 40.00 Median : 5.000 Median :1.0000 Mean :106.0 Mean : 44.31 Mean : 5.199 Mean :0.5355 3rd Qu.:158.5 3rd Qu.: 53.75 3rd Qu.: 6.000 3rd Qu.:1.0000 Max. :211.0 Max. :165.00 Max. :10.000 Max. :1.0000 … Y NEWID Min. :505.5 Min. :1001 1st Qu.:528.8 1st Qu.:1054 Median :544.5 Median :1106 Mean :544.2 Mean :1106 3rd Qu.:559.0 3rd Qu.:1159 Max. :581.0 Max. :1211 Note the NEWID variable. Make sure to detach the old data set and attach the newbalt data frame. Now, specify NEWID as the id variable (second parameter) in the read.gal command and follow this by the summary command to obtain the weights characteristics. > attach(newbalt) > baltnb2 <- read.gal("baltpolold.GAL",NEWID) > summary(baltnb2) Connectivity of baltnb2 with the following attributes: List of 5 $ class : chr "nb" $ region.id: int [1:211] 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ... $ gal : logi TRUE $ call : logi TRUE $ sym : logi TRUE NULL Number of regions: 211 Number of nonzero links: 1198 Percentage nonzero weights: 2.690865 Average number of links: 5.677725 Link number distribution: 7 3 4 5 6 7 8 9 11 7 21 66 75 29 9 3 1 7 least connected regions: 1003 1006 1048 1097 1144 1157 1199 with 3 links 1 most connected region: 1208 with 11 links Note how the region.id is now a list of integer values (not characters) and how the least and most connected regions are identified by their ID value (not by the sequence number). In real applications, these are often more meaningful than simple sequence numbers. Converting GAL weights to neighbor objects – ID Variable The new function read.gal2 is designed to handle GAL format files with contiguity information expressed using a general ID variable. For example, the rook and queen weights for US counties available from the CSISS spatial weights repository use the old header format with ID variables (see http://sal.agecon.uiuc.edu/weights/index.html). Note that GeoDa has implemented a slightly more complex header line which can also be parsed correctly by read.gal2. To illustrate this function, consider the GAL file shown in Figure 4. This was obtained using the newly created point shape file baltim2.shp. First, a Thiessen polygon file was created that contains the NEWID variable, say baltpoly2. Then baltpoly2 was used to create a rook contiguity weights file, specifying NEWID as the ID variable, as shown in Figure 5. The file in Figure 4 resulted after the original GAL file was edited by hand to reflect the old header line format. The new header line contained the values 0 211 baltpoly0 NEWID, as shown in Figure 6. All this editing can be carried out in a standard text editor, such as Notepad or Wordpad. Figure 4. Old-style GAL contiguity file using ID variable. 8 Figure 5. Creating Rook contiguity weights and specifying an ID variable. Figure 6. Contiguity file with new format header line. You invoke read.gal2 essentially in the same way as read.gal, with the important distinction that only the GAL file can be used as input (only one parameter is allowed). For example, using baltpoly0.GAL as input, followed by the summary command, yields: > baltnb22 <- read.gal2("baltpoly0.GAL") > summary(baltnb22) Connectivity of baltnb22 with the following attributes: List of 5 9 $ class : chr "nb" $ region.id: chr [1:211] "1001" "1002" "1003" "1004" ... $ gal : logi TRUE $ call : logi TRUE $ sym : logi TRUE NULL Number of regions: 211 Number of nonzero links: 1198 Percentage nonzero weights: 2.690865 Average number of links: 5.677725 Link number distribution: 3 4 5 6 7 8 9 11 7 21 66 75 29 9 3 1 7 least connected regions: 1003 1006 1048 1097 1144 1157 1199 with 3 links 1 most connected region: 1208 with 11 links Note how this is identical to the previous result. There is one important restriction to the read.gal2 function as it is currently implemented. The order of the contiguity information must be the same as the order of the records in the data frame. Again, if both are generated from the same shape file using GeoDa, this should be fine. However, it is not a good assumption to use in general and it will create problems when there is no one to one match between the order of observations in the data frame that holds the ID variable and the order in which the contiguity information is entered in the GAL file. In read.gal2, the ids are constructed from the contents of the GAL file, but they are not checked against the values of the ID variable in the data frame. When both the data frame and the GAL file are created by GeoDa, this should be fine, but in other circumstances, this may create errors. Converting GAL files to neighbor objects – New Format The new format GAL files created by GeoDa contain a slightly more elaborate header line, as in Figure 6, with four items on the first line: 0, a place holder; 211, the number of observations; baltpoly2, the name of the shape file from which the contiguity information was derived; and NEWID, the ID variable. The third function contained in the read.gal3.R helper file is read.gal3, which is the safest of the three. In read.gal3, the ID variable name is read from the GAL file header line and matched to a variable in the currently attached data frame. The values in the GAL file are matched to the observation sequence in the data frame by checking the ID variable directly, and any mismatches are detected as errors. As long as the ID variable is in fact present in the current work space, this will guarantee a correct result, irrespective of the order in which the contiguity information is stored in the GAL file. In addition, “islands” are dealt with correctly and don’t generate error messages. If the ID variable is not present in the current work space (or if there is a typo; note that variable names are case sensitive) an error message results: > test <- read.gal3("baltpolybad.GAL") 10 Error in get(x, envir, mode, inherits) : variable "NEWID1" was not found To cause the error, a new GAL file was created (baltpolybad.GAL) by changing the ID variable name to NEWID1. The current data frame does not contain this variable, hence the error message. The read.gal3 function is invoked in the same manner as read.gal2 and only the file name for the GAL file must be specified. No other parameters can be included. For example, use the baltpoly2.GAL file from Figure 6, and invoke read.gal3 followed by summary. This yields, as before: > baltnb2 <- read.gal3("baltpoly2.GAL") > summary(baltnb2) Connectivity of baltnb2 with the following attributes: List of 5 $ class : chr "nb" $ region.id: chr [1:211] "1001" "1002" "1003" "1004" ... $ gal : logi TRUE $ call : logi TRUE $ sym : logi TRUE NULL Number of regions: 211 Number of nonzero links: 1198 Percentage nonzero weights: 2.690865 Average number of links: 5.677725 Link number distribution: 3 4 5 6 7 8 9 11 7 21 66 75 29 9 3 1 7 least connected regions: 1003 1006 1048 1097 1144 1157 1199 with 3 links 1 most connected region: 1208 with 11 links Characteristics of spatial weights. The characteristics of spatial weights are obtained with the summary function applied to a neighbor object, as illustrated in the examples given above. The same result is also obtained by applying the summary.nb function, as in > summary.nb(baltnb2) This yields a brief description of the number of observations, the “density” of the spatial weights matrix (percentage nonzero weights), the average number of neighbors (links) per observation and the frequency distribution of the neighbor count. It also lists the ids of the least and most connected observations, as well as any islands, if present. Islands must be dealt with, either by manually editing the neighbor list to “connect” them to some other observation or by eliminating them from the analysis. This can be accomplished by means of the edit.nb command (see the spdep package manual, p. 20, for technical details). 11 It is important to note that islands never count in an analysis of spatial autocorrelation, since only connected observations are included in the computation of the statistics. The plot command can also be used to obtain a descriptive overview of the connectedness structure in a neighbor object. It can be applied generically to a neighbor object, as plot, or explicitly as plot.nb. The name of the neighbor object and a matrix of X, Y coordinates must be supplied (note the use of cbind to construct the matrix from the data frame variables), as well as any of the other usual R graphics parameters. For example, applied to baltnb2, this yields the plot shown in Figure 7 (with the titles added for extra effect): > plot.nb(baltnb2,cbind(X,Y),col="red") > title(main="First Order Contiguity from Thiessen Polygons") > title(xlab="Baltimore Point Data") Figure 7. Plot of connectedness structure. 12 Spatial weights objects in spdep The neighbor list (nb) is only one of several classes that handle spatial contiguity information in spdep. For example, spatial weights objec
本文档为【Data and Spatial Weights in spdep】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_987709
暂无简介~
格式:pdf
大小:186KB
软件:PDF阅读器
页数:19
分类:
上传时间:2013-04-21
浏览量:59