Data and Spatial Weights in spdep
Notes and Illustrations
Luc Anselin
University of Illinois, Urbana-Champaign
http://sal.agecon.uiuc.edu
July 11, 2003
Purpose
The purpose of this set of notes is to illustrate how to bring data and spatial weights
information from GeoDa into R, for use by the spdep package. In addition, several of the
weights construction and manipulation functions in spdep are reviewed. Both built-in
spdep functions as well as a smaller number of customized R functions will be used and
illustrated. For details on GeoDa functionality, see the GeoDa User’s Guide and Tutorials
at http://sal.agecon.uiuc.edu/stuff. Further details on the operations of spdep can be found
in the original spdep documentation and in the “Introduction to Spatial Regression
Analysis in R” tutorial on the SAL web site. The various tasks will be illustrated with a
sample data set. It is assumed that you will try to replicate this with your own data and/or
with another one of the sample data sets. To get started, make sure the spdep package is
loaded, with library(spdep)
Converting GeoDa data to a data frame
Data in R are most usefully stored in a data frame. The helper function read.geoda uses
the built-in read.csv function with some preset options to make sure it matches the
format of the data exported by GeoDa. For example, load the BALTIM.SHP point data set
into GeoDa and use Tools > Data Export > ASCII. Specify the input shape file and
give a name for the text file that will contain the exported data (such as baltim.txt).
You will have exported a file that looks like Figure 1.
Figure 1. Contents of baltim.txt, exported by GeoDa
2
Two things are non-standard as far as the usual R function read.table goes. One is the
extra line in the front (with number of observations and number of variables), the other is
the use of a comma as data separator. You can deal with these by explicitly setting the
“skip” option to 1 and using read.csv instead of the generic read.table. For example:
> balt <- read.csv("baltim.txt",header=TRUE,skip=1)
> summary(balt)
STATION PRICE NROOM DWELL
Min. : 1.0 Min. : 3.50 Min. : 3.000 Min. :0.0000
1st Qu.: 53.5 1st Qu.: 30.95 1st Qu.: 5.000 1st Qu.:0.0000
Median :106.0 Median : 40.00 Median : 5.000 Median :1.0000
Mean :106.0 Mean : 44.31 Mean : 5.199 Mean :0.5355
3rd Qu.:158.5 3rd Qu.: 53.75 3rd Qu.: 6.000 3rd Qu.:1.0000
Max. :211.0 Max. :165.00 Max. :10.000 Max. :1.0000
etc…
You can also explicitly specify a record identifier by means of the row.names option. For
the BALTIM.TXT data, this would be STATION:
>balt3 <- read.csv("baltim.txt",header=TRUE,skip=1,row.names="STATION")
As a result of this specification, the STATION column is no longer part of the data frame
as a separate variable, it is now used explicitly as the row indicator.
> summary(balt3)
PRICE NROOM DWELL NBATH
Min. : 3.50 Min. : 3.000 Min. :0.0000 Min. :1.000
1st Qu.: 30.95 1st Qu.: 5.000 1st Qu.:0.0000 1st Qu.:1.000
Median : 40.00 Median : 5.000 Median :1.0000 Median :1.500
Mean : 44.31 Mean : 5.199 Mean :0.5355 Mean :1.573
3rd Qu.: 53.75 3rd Qu.: 6.000 3rd Qu.:1.0000 3rd Qu.:2.000
Max. :165.00 Max. :10.000 Max. :1.0000 Max. :5.000
etc…
These options are incorporated in the read.geoda helper function. You “compile” this
function with the R source command. Make sure you copy the file read.geoda.R to
your working directory. Then “source” it as
> source("read.geoda.R")
The read.geoda function (drop the R file extension) is now added to your workspace
and can be invoked directly. For example, to create a data frame without a row identifier:
> balt00 <- read.geoda("baltim.txt")
> summary(balt00)
STATION PRICE NROOM DWELL
Min. : 1.0 Min. : 3.50 Min. : 3.000 Min. :0.0000
1st Qu.: 53.5 1st Qu.: 30.95 1st Qu.: 5.000 1st Qu.:0.0000
Median :106.0 Median : 40.00 Median : 5.000 Median :1.0000
Mean :106.0 Mean : 44.31 Mean : 5.199 Mean :0.5355
3rd Qu.:158.5 3rd Qu.: 53.75 3rd Qu.: 6.000 3rd Qu.:1.0000
3
Max. :211.0 Max. :165.00 Max. :10.000 Max. :1.0000
etc…
Specifying STATION as the row identifier:
> balt01 <- read.geoda("baltim.txt","STATION")
> summary(balt01)
PRICE NROOM DWELL NBATH
Min. : 3.50 Min. : 3.000 Min. :0.0000 Min. :1.000
1st Qu.: 30.95 1st Qu.: 5.000 1st Qu.:0.0000 1st Qu.:1.000
Median : 40.00 Median : 5.000 Median :1.0000 Median :1.500
Mean : 44.31 Mean : 5.199 Mean :0.5355 Mean :1.573
3rd Qu.: 53.75 3rd Qu.: 6.000 3rd Qu.:1.0000 3rd Qu.:2.000
Max. :165.00 Max. :10.000 Max. :1.0000 Max. :5.000
etc…
In order to be able to refer to the variables in a data frame directly by their name, instead
of using framename$variablename, you can attach the data frame. For example, if you
use the mean command with X, you get an error message:
> mean(X)
Error in mean(X) : Object "X" not found
After attaching the data frame, this command works fine:
> attach(balt01)
> mean(X)
[1] 911.646
After you are done using a given data frame, it is good practice to detach it:
> detach(balt01)
This frees up work space and avoids possible naming conflicts.
Converting GAL weights to neighbor objects – Old Format, Sequence Numbers
Contiguity information on spatial objects is contained in a neighbor object (nb) in spdep.
This contiguity information can be constructed for polygon shape files using the Create
Weights function in GeoDa. The resulting GAL file (text file with file extension .gal) can
be read into spdep by means of the read.gal function. However, care must be taken with
the newer format used by GeoDa. Before delving into that, consider the “old” format
case, where the first line in the file gives the number of observations. The remainder of
the file gives, for each observation, the observation ID, the number of neighbors and then
on a second line the IDs for the neighbors. For now, we will use a simple sequence
number as an ID. For example, using GeoDa to create a set of Thiessen polygons for the
BALTIM point data set (see the GeoDa User’s Guide, pp. 26-28), you can create a “rook”
type contiguity weights file. Leave the entry for the ID variable untouched, as in Figure 2.
This will use the sequence numbers as identifiers.
4
Figure 2. Creating a weights file without explicit ID variable.
This is usually dangerous, unless you can guarantee that the record sequence in the
internal storage in GeoDa will be the same as in the external data set. As long as you
create the data frame using the exported file from GeoDa, based on the same shape file,
things should be OK, but it is “living dangerously.” The explicit designation of an ID
variable is much preferred.
After creating the GAL file using this procedure, the format of the file is as in Figure 3.
The text file has a header line with the number of observations, followed by the
contiguity information, going from observation 1 to 211, in sequence.
Figure 3. GAL weights file without ID variable.
The standard read.gal function in spdep assumes that the contiguity information is
given in sequence. This has been generalized in the most recent version of read.gal
5
included in the helper files, which is considered in the next section. To have access to the
full range of options, it is best to copy the file read.gal3.R to your working directory
and source it into R:
> source("read.gal3.R")
Note that this will replace the original read.gal function included in the spdep package
with a more recent (and slightly more flexible) version. It will also add some new
functions, such as read.gal2 and read.gal3.
The read.gal function has two parameters: the file name (include the GAL file
extension), and an option to specify observation ids. The spdep manual refers to these as
region ids, which is the matching attribute of the nb object (“region.id”). When no
region ID is specified, sequence numbers are assigned. Note that the region ID must be a
character vector that is already in the work space (through attaching a data frame, for
example). Also, this is not the same as a general identifier for the observations and their
neighbors, since its value is not extracted from the GAL file, but taken from the data
frame. In the neighbor object, it is added as an additional attribute, in addition to the
contiguity information. It is used in the summary function, but is not part of the contiguity
information. In contrast to the original read.gal function in spdep version 0.1-10, the
newer function can read GAL files with both the old header (only the number of
observations) and the new header (four entries). However, it does require that the
contiguity entries are in the same sequence as the observations in the data frame.
Using the baltpolold.GAL file as the input, a neighbor object is constructed as
> baltnb1 <- read.gal("baltpolold.GAL")
and its characteristics can be obtained with the summary function. Note how the
$region.id attribute is chr[1:221], a list of sequential integer characters. The summary
characteristics describe the connectedness structure of the spatial weights. The “link
number distribution” contains the same information as the histogram of weights
characteristics in GeoDa (pp. 86-87 in the GeoDa User’s Guide).
> summary(baltnb1)
Connectivity of baltnb1 with the following attributes:
List of 5
$ class : chr "nb"
$ region.id: chr [1:211] "1" "2" "3" "4" ...
$ gal : logi TRUE
$ call : logi TRUE
$ sym : logi TRUE
NULL
Number of regions: 211
Number of nonzero links: 1198
Percentage nonzero weights: 2.690865
Average number of links: 5.677725
Link number distribution:
6
3 4 5 6 7 8 9 11
7 21 66 75 29 9 3 1
7 least connected regions:
3 6 48 97 144 157 199 with 3 links
1 most connected region:
208 with 11 links
In order to explicitly specify a region.id variable, you need to make things a bit more
interesting, since STATION is a simple sequential numbering. In GeoDa, you can use the
table calculation functions to create a new variable, NEWID, for example as 1000 +
STATION. Save this as a new shape file, say baltim2.shp. Now turn baltim2.shp into a
new data frame, first exporting the data to baltim2.txt, then reading it into a table:
> newbalt <- read.geoda("baltim2.txt")
> summary(newbalt)
STATION PRICE NROOM DWELL
Min. : 1.0 Min. : 3.50 Min. : 3.000 Min. :0.0000
1st Qu.: 53.5 1st Qu.: 30.95 1st Qu.: 5.000 1st Qu.:0.0000
Median :106.0 Median : 40.00 Median : 5.000 Median :1.0000
Mean :106.0 Mean : 44.31 Mean : 5.199 Mean :0.5355
3rd Qu.:158.5 3rd Qu.: 53.75 3rd Qu.: 6.000 3rd Qu.:1.0000
Max. :211.0 Max. :165.00 Max. :10.000 Max. :1.0000
…
Y NEWID
Min. :505.5 Min. :1001
1st Qu.:528.8 1st Qu.:1054
Median :544.5 Median :1106
Mean :544.2 Mean :1106
3rd Qu.:559.0 3rd Qu.:1159
Max. :581.0 Max. :1211
Note the NEWID variable. Make sure to detach the old data set and attach the newbalt data
frame. Now, specify NEWID as the id variable (second parameter) in the read.gal
command and follow this by the summary command to obtain the weights characteristics.
> attach(newbalt)
> baltnb2 <- read.gal("baltpolold.GAL",NEWID)
> summary(baltnb2)
Connectivity of baltnb2 with the following attributes:
List of 5
$ class : chr "nb"
$ region.id: int [1:211] 1001 1002 1003 1004 1005 1006 1007 1008 1009
1010 ...
$ gal : logi TRUE
$ call : logi TRUE
$ sym : logi TRUE
NULL
Number of regions: 211
Number of nonzero links: 1198
Percentage nonzero weights: 2.690865
Average number of links: 5.677725
Link number distribution:
7
3 4 5 6 7 8 9 11
7 21 66 75 29 9 3 1
7 least connected regions:
1003 1006 1048 1097 1144 1157 1199 with 3 links
1 most connected region:
1208 with 11 links
Note how the region.id is now a list of integer values (not characters) and how the least
and most connected regions are identified by their ID value (not by the sequence
number). In real applications, these are often more meaningful than simple sequence
numbers.
Converting GAL weights to neighbor objects – ID Variable
The new function read.gal2 is designed to handle GAL format files with contiguity
information expressed using a general ID variable. For example, the rook and queen
weights for US counties available from the CSISS spatial weights repository use the old
header format with ID variables (see http://sal.agecon.uiuc.edu/weights/index.html). Note
that GeoDa has implemented a slightly more complex header line which can also be
parsed correctly by read.gal2. To illustrate this function, consider the GAL file shown
in Figure 4. This was obtained using the newly created point shape file baltim2.shp.
First, a Thiessen polygon file was created that contains the NEWID variable, say
baltpoly2. Then baltpoly2 was used to create a rook contiguity weights file,
specifying NEWID as the ID variable, as shown in Figure 5. The file in Figure 4 resulted
after the original GAL file was edited by hand to reflect the old header line format. The
new header line contained the values 0 211 baltpoly0 NEWID, as shown in Figure 6.
All this editing can be carried out in a standard text editor, such as Notepad or Wordpad.
Figure 4. Old-style GAL contiguity file using ID variable.
8
Figure 5. Creating Rook contiguity weights and specifying an ID variable.
Figure 6. Contiguity file with new format header line.
You invoke read.gal2 essentially in the same way as read.gal, with the important
distinction that only the GAL file can be used as input (only one parameter is allowed).
For example, using baltpoly0.GAL as input, followed by the summary command, yields:
> baltnb22 <- read.gal2("baltpoly0.GAL")
> summary(baltnb22)
Connectivity of baltnb22 with the following attributes:
List of 5
9
$ class : chr "nb"
$ region.id: chr [1:211] "1001" "1002" "1003" "1004" ...
$ gal : logi TRUE
$ call : logi TRUE
$ sym : logi TRUE
NULL
Number of regions: 211
Number of nonzero links: 1198
Percentage nonzero weights: 2.690865
Average number of links: 5.677725
Link number distribution:
3 4 5 6 7 8 9 11
7 21 66 75 29 9 3 1
7 least connected regions:
1003 1006 1048 1097 1144 1157 1199 with 3 links
1 most connected region:
1208 with 11 links
Note how this is identical to the previous result. There is one important restriction to the
read.gal2 function as it is currently implemented. The order of the contiguity
information must be the same as the order of the records in the data frame. Again, if both
are generated from the same shape file using GeoDa, this should be fine. However, it is
not a good assumption to use in general and it will create problems when there is no one
to one match between the order of observations in the data frame that holds the ID
variable and the order in which the contiguity information is entered in the GAL file. In
read.gal2, the ids are constructed from the contents of the GAL file, but they are not
checked against the values of the ID variable in the data frame. When both the data frame
and the GAL file are created by GeoDa, this should be fine, but in other circumstances,
this may create errors.
Converting GAL files to neighbor objects – New Format
The new format GAL files created by GeoDa contain a slightly more elaborate header
line, as in Figure 6, with four items on the first line: 0, a place holder; 211, the number of
observations; baltpoly2, the name of the shape file from which the contiguity information
was derived; and NEWID, the ID variable. The third function contained in the
read.gal3.R helper file is read.gal3, which is the safest of the three. In read.gal3, the
ID variable name is read from the GAL file header line and matched to a variable in the
currently attached data frame. The values in the GAL file are matched to the observation
sequence in the data frame by checking the ID variable directly, and any mismatches are
detected as errors. As long as the ID variable is in fact present in the current work space,
this will guarantee a correct result, irrespective of the order in which the contiguity
information is stored in the GAL file. In addition, “islands” are dealt with correctly and
don’t generate error messages.
If the ID variable is not present in the current work space (or if there is a typo; note that
variable names are case sensitive) an error message results:
> test <- read.gal3("baltpolybad.GAL")
10
Error in get(x, envir, mode, inherits) : variable "NEWID1" was not
found
To cause the error, a new GAL file was created (baltpolybad.GAL) by changing the ID
variable name to NEWID1. The current data frame does not contain this variable, hence the
error message.
The read.gal3 function is invoked in the same manner as read.gal2 and only the file
name for the GAL file must be specified. No other parameters can be included. For
example, use the baltpoly2.GAL file from Figure 6, and invoke read.gal3 followed by
summary. This yields, as before:
> baltnb2 <- read.gal3("baltpoly2.GAL")
> summary(baltnb2)
Connectivity of baltnb2 with the following attributes:
List of 5
$ class : chr "nb"
$ region.id: chr [1:211] "1001" "1002" "1003" "1004" ...
$ gal : logi TRUE
$ call : logi TRUE
$ sym : logi TRUE
NULL
Number of regions: 211
Number of nonzero links: 1198
Percentage nonzero weights: 2.690865
Average number of links: 5.677725
Link number distribution:
3 4 5 6 7 8 9 11
7 21 66 75 29 9 3 1
7 least connected regions:
1003 1006 1048 1097 1144 1157 1199 with 3 links
1 most connected region:
1208 with 11 links
Characteristics of spatial weights.
The characteristics of spatial weights are obtained with the summary function applied to a
neighbor object, as illustrated in the examples given above. The same result is also
obtained by applying the summary.nb function, as in
> summary.nb(baltnb2)
This yields a brief description of the number of observations, the “density” of the spatial
weights matrix (percentage nonzero weights), the average number of neighbors (links)
per observation and the frequency distribution of the neighbor count. It also lists the ids
of the least and most connected observations, as well as any islands, if present. Islands
must be dealt with, either by manually editing the neighbor list to “connect” them to
some other observation or by eliminating them from the analysis. This can be
accomplished by means of the edit.nb command (see the spdep package manual, p. 20,
for technical details).
11
It is important to note that islands never count in an analysis of spatial autocorrelation,
since only connected observations are included in the computation of the statistics.
The plot command can also be used to obtain a descriptive overview of the
connectedness structure in a neighbor object. It can be applied generically to a neighbor
object, as plot, or explicitly as plot.nb. The name of the neighbor object and a matrix
of X, Y coordinates must be supplied (note the use of cbind to construct the matrix from
the data frame variables), as well as any of the other usual R graphics parameters. For
example, applied to baltnb2, this yields the plot shown in Figure 7 (with the titles added
for extra effect):
> plot.nb(baltnb2,cbind(X,Y),col="red")
> title(main="First Order Contiguity from Thiessen Polygons")
> title(xlab="Baltimore Point Data")
Figure 7. Plot of connectedness structure.
12
Spatial weights objects in spdep
The neighbor list (nb) is only one of several classes that handle spatial contiguity
information in spdep. For example, spatial weights objec
本文档为【Data and Spatial Weights in spdep】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。