Latest Blog Entries

Open Source Approach: Python

Following the previous posting - Open Source Approach: R - we now look at basic GIS functions using Python packages. Although ESRI's arcpy gives us comprehensive GIS data models and methods without supplementing it with extra spatial packages, it still requires proprietary and expensive license to use them. This tutorial attempts to do the same Python spatial data analyses wihtout such proprietary technologies. Also, our focus will be on basic geospatial libraries in Python, especially GDAL/OGR and GEOS.

IMPORTANT! This tutorial is posted on my university webspace as iPython Notebook (http://home.uchicago.edu/~cmaene/OpenSourceApproach_Python.html) - with all outputs/graphics - please check the site for better viewing..

#!/user/bin/python
# -*- coding: utf-8 -*- python version 2.7.6

# load the libraries we will use in this tutorial..
import os, sys, ogr, osr, gdal
import matplotlib.pyplot as plt
# to make iPython notebook to produce plots inline
%matplotlib inline

Preparation - download and unzip data, set your working directory.

# set the working directory
os.chdir('/home/chieko/Documents/Python')

## download the tutorial data - skip this part if you've done this already
## need urllib if you download a data file (see the next block) in Python
#import urllib
#urllib.urlretrieve("http://home.uchicago.edu/~cmaene/data.zip", filename="data.zip")
## unzip the file - may not work with all OS - skip this part if you've done this already
#os.system('unzip data.zip')

os.chdir('data')
# os.listdir('.') # see what are in there
print "current working directory is:", os.getcwd()

Read spatial data files with GDAL's ogr.

# read spatial objects in CSV - require supplemental/descriptive meta file, VRT
# VRT acts like a supplemental metadata, describes another CSV properties
with open('GPS.vrt', 'r') as vrt:
    print "This is what GPS.vrt looks like:"
    print vrt.read()

datasource=ogr.GetDriverByName('VRT').Open('GPS.vrt')
# because of VRT description, our CSV points know their SRS!
print "Spatial Reference System (SRS) of the data source is\n", datasource.GetLayer().GetSpatialRef()
print "\nTranslate the Well-Known-Text SRS to Proj4! \n", datasource.GetLayer().GetSpatialRef().ExportToProj4()

# optional: conversion from VRT to shapefile - I think GDAL utility, ogr2ogr, is the best/easiest
# os.system('ogr2ogr -skipfailures -overwrite -f "ESRI Shapefile" GPS.shp GPS.vrt')
# plot the GPS points
plt.figure()
# loop thru shaperecords to get x/y from each feature (point/line/polygon)
x,y=[],[]
for feature in datasource.GetLayer():
    geomRef=feature.GetGeometryRef()
    x.append(geomRef.GetX()) # get X from each point and add to the list
    y.append(geomRef.GetY()) # get y from each point and add to the list
plt.plot(x,y, 'ro', label='GPS points') # 'ro' means red o-marker
plt.legend(numpoints=1)
plt.show()

Plot the GPS points with lines and a study area polygon.

plt.figure()

# add the GPS points
gps=ogr.GetDriverByName('ESRI Shapefile').Open('GPS.shp')
for feature in gps.GetLayer():
    geomRef=feature.GetGeometryRef()
    x=[geomRef.GetX(i) for i in range(geomRef.GetPointCount())] #although points has only one object..
    y=[geomRef.GetY(i) for i in range(geomRef.GetPointCount())] #although points has only one object..
    i1,=plt.plot(x,y, 'ro')
    
# read road lines and add the lines - learn the object
roads=ogr.GetDriverByName('ESRI Shapefile').Open('roads.shp')
print "Line : roads"
print "- Geometry type is", roads.GetLayer().GetGeomType()
# OGRwkbGeometryType {
# wkbUnknown = 0, wkbPoint = 1, wkbLineString = 2, wkbPolygon = 3,
# wkbMultiPoint = 4, wkbMultiLineString = 5, wkbMultiPolygon = 6, wkbGeometryCollection = 7,
# wkbCircularString = 8, wkbCompoundCurve = 9, wkbCurvePolygon = 10, wkbMultiCurve = 11,
# wkbMultiSurface = 12, wkbNone = 100, wkbLinearRing = 101, wkbCircularStringZ = 1008, ...
feature0=roads.GetLayer().GetFeature(0)
print "- The feature's geometry name is", feature0.GetGeometryRef().GetGeometryName()
print "- Structure of the 1st feature is", feature0.GetGeometryRef().ExportToWkt() 
print "- Number of points in the 1st feature is", feature0.GetGeometryRef().GetPointCount()
for feature in roads.GetLayer():
    geomRef=feature.GetGeometryRef()
    x=[geomRef.GetX(i) for i in range(geomRef.GetPointCount())]
    y=[geomRef.GetY(i) for i in range(geomRef.GetPointCount())]
    i2,=plt.plot(x,y, 'r--', color='blue')
    
# read studyarea polygon and add the polygon
area=ogr.GetDriverByName('ESRI Shapefile').Open('studyarea.shp')
print "\nPolygon : study area. Feature geometry is complex - compare to the line above"
print "- Geometry type is", area.GetLayer().GetGeomType()
feature0=area.GetLayer().GetFeature(0)
print "- The feature's geometry name is", feature0.GetGeometryRef().GetGeometryRef(0).GetGeometryName()
print "- Structure of the 1st feature is", feature0.GetGeometryRef().ExportToWkt() # notice that polygon points are within extra ()
print "- Number of points in the 1st feature are", feature0.GetGeometryRef().GetGeometryRef(0).GetPointCount(), "(the first and the last points are identical!)"
for feature in area.GetLayer(): 
    geomRef=feature.GetGeometryRef()
    x=[geomRef.GetGeometryRef(0).GetX(i) for i in range(geomRef.GetGeometryRef(0).GetPointCount())]
    y=[geomRef.GetGeometryRef(0).GetY(i) for i in range(geomRef.GetGeometryRef(0).GetPointCount())]
    plt.plot(x,y, '-', color='black')
    i3,=plt.fill(x, y, 'yellow')

# draw the plot
plt.xlabel('Longitude')
plt.ylabel('Latitude')
# take the scientific notation from xy axis tick labels
ax = plt.gca()
ax.get_xaxis().get_major_formatter().set_useOffset(False)
ax.get_yaxis().get_major_formatter().set_useOffset(False)
# add legend items
legend=plt.legend([i1,i2,i3],['GPS points','Roads','Study area'])
legend.get_frame().set_facecolor('white')
plt.show()

Despite having somehow complex data models, ogr has many advantages. Select by attribute is one.. Reference: (1) by attribute filter: https://pcjericks.github.io/py-gdalogr-cookbook/vector_layers.html#filte... , (2) by SQL: http://www.gdal.org/ogr_sql.html, http://gdal.org/python/osgeo.ogr.DataSource-class.html#ExecuteSQL

# census tract mapping -
datasource=ogr.GetDriverByName('ESRI Shapefile').Open('tracts.shp')
# read SRS info
print "Spatial Reference System (SRS) of the data source is:\n", datasource.GetLayer().GetSpatialRef()
print "\nNumber of census tracts are", datasource.GetLayer().GetFeatureCount()

# start a new plot
plt.figure()

# read and plot the census tract shapefile
for feature in datasource.GetLayer():
    geomRef=feature.GetGeometryRef()
    x = [geomRef.GetGeometryRef(0).GetX(i) for i in range(geomRef.GetGeometryRef(0).GetPointCount())]
    y = [geomRef.GetGeometryRef(0).GetY(i) for i in range(geomRef.GetGeometryRef(0).GetPointCount())]
    plt.plot(x,y, '-', color='black')
    i1,=plt.fill(x,y,'white')

# one way to make a selection is to set an attribute filter
datasource.GetLayer().SetAttributeFilter("BLK>1000")
print "Number of Census tracts with BLK>1000 are", datasource.GetLayer().GetFeatureCount()
# clear previous attribute selection filter by passing None
datasource.GetLayer().SetAttributeFilter(None) 
    
# another way is to use SQL
selection=datasource.ExecuteSQL("select * from tracts where BLK>1000")
print "The SQL result is a new OGR layer:", type(selection)

# plot the selected census tracts
for feature in selection:
    geomRef=feature.GetGeometryRef()
    x = [geomRef.GetGeometryRef(0).GetX(i) for i in range(geomRef.GetGeometryRef(0).GetPointCount())]
    y = [geomRef.GetGeometryRef(0).GetY(i) for i in range(geomRef.GetGeometryRef(0).GetPointCount())]
    plt.plot(x,y, '-', color='black')
    i2,=plt.fill(x, y, 'blue')
plt.gca().set_aspect('equal', adjustable='box') # make axes intervals the same
plt.legend([i1,i2],['Tracts','AA>1000'], loc='lower left')
plt.show()

Here we try buffer operation - an OGR function built on GEOS library

# inputs are Chicago Train stations
sts=ogr.GetDriverByName('ESRI Shapefile').Open('CTAbrown.shp')
print "Number of features in the CTAbrown.shp is", sts.GetLayer().GetFeatureCount()

# Let's create a buffer around the stations, will also merge overlapping buffers
stsBufferMerge = ogr.Geometry(ogr.wkbPolygon)  # create an empty polygon
# for each feature/point create a buffer
for feature in sts.GetLayer():
    geomRef=feature.GetGeometryRef()
    stsBuffer=geomRef.Buffer(2640)
    # then merge the new buffer with the merge-buffer-polygon
    stsBufferMerge=stsBufferMerge.Union(stsBuffer)
print "Number of geometries in the buffer is", stsBufferMerge.GetGeometryCount()
print "Total area of the buffer is", format(stsBufferMerge.GetArea(),'.0f'), "square feet."

# plot
plt.figure()
x,y=[],[]
# correct XY of all station points
sts=ogr.GetDriverByName('ESRI Shapefile').Open('CTAbrown.shp')
for feature in sts.GetLayer():
    geomRef=feature.GetGeometryRef()
    x.append(geomRef.GetX()) # get X from each point and add to the list
    y.append(geomRef.GetY()) # get y from each point and add to the list
i1,=plt.plot(x,y, 'ro', color='brown')
# add the merged buffer - no loop, only one geometry
x = [stsBufferMerge.GetGeometryRef(0).GetX(i) for i in range(stsBufferMerge.GetGeometryRef(0).GetPointCount())]
y = [stsBufferMerge.GetGeometryRef(0).GetY(i) for i in range(stsBufferMerge.GetGeometryRef(0).GetPointCount())]
i2,=plt.plot(x,y, '-', color='saddlebrown')
plt.gca().set_aspect('equal', adjustable='box')
plt.legend([i1,i2],['Input','Buffer'], loc='lower left')
plt.show()

Reprojection - changing spatial reference system (SRS). Here, we reproject apartments to save in the same SRS as the buffer. I am using GDAL system utility, ogr2ogr - much easier to achieve than using OGR/OSR python package

# compare SRS of the two shapefiles
sts=ogr.GetDriverByName('ESRI Shapefile').Open('CTAbrown.shp')
print "SRS of CTA stations is:\n", sts.GetLayer().GetSpatialRef().ExportToProj4()
apt=ogr.GetDriverByName('ESRI Shapefile').Open('apartments.shp')
print "\nSRS of the original apartments is:\n", apt.GetLayer().GetSpatialRef().ExportToProj4()

# copy SRS from "CTA stations" in proj4, then apply it to reproject "apartments" which was originally in WGS84
# if we knew EPSG code, we could also use the code: ex. tosrs='EPSG:102671'
tosrs=sts.GetLayer().GetSpatialRef().ExportToProj4()
fmsrs=apt.GetLayer().GetSpatialRef().ExportToProj4()
args='ogr2ogr -overwrite -f "ESRI Shapefile" -t_srs "'+tosrs+'" -s_srs "'+fmsrs+'" apartmentsNEW.shp apartments.shp'
os.system(args)

# use the reproject apartments
apt=ogr.GetDriverByName('ESRI Shapefile').Open('apartmentsNEW.shp')
print "\nSRS of the reprojected apartments is:\n", apt.GetLayer().GetSpatialRef().ExportToProj4()

Another OGR advantage is the ability to select objects by location. We will use the previously merged buffer to select apartments

# plot all apartments
plt.figure()
x,y=[],[]
# correct XY of all station points
apt=ogr.GetDriverByName('ESRI Shapefile').Open('apartmentsNEW.shp')
for feature in apt.GetLayer():
    geomRef=feature.GetGeometryRef()
    x.append(geomRef.GetX()) # get X from each point and add to the list
    y.append(geomRef.GetY()) # get y from each point and add to the list
i1,=plt.plot(x,y, 'bo')

# make a selection by location/buffer zone using a spatial filter
apt.GetLayer().SetSpatialFilter(stsBufferMerge)
print "Total number of selected apartments are:", apt.GetLayer().GetFeatureCount()

# plot selected apartments
x,y=[],[]
# correct XY of all station points
for feature in apt.GetLayer():
    geomRef=feature.GetGeometryRef()
    x.append(geomRef.GetX()) # get X from each point and add to the list
    y.append(geomRef.GetY()) # get y from each point and add to the list
i2,=plt.plot(x,y, 'ro')

# add the merged buffer - no loop, only one geometry
x = [stsBufferMerge.GetGeometryRef(0).GetX(i) for i in range(stsBufferMerge.GetGeometryRef(0).GetPointCount())]
y = [stsBufferMerge.GetGeometryRef(0).GetY(i) for i in range(stsBufferMerge.GetGeometryRef(0).GetPointCount())]
i3,=plt.plot(x,y, '-', color='saddlebrown')

# put all together
plt.gca().set_aspect('equal', adjustable='box')
plt.legend([i1,i2,i3],['Input','Selected','Buffer'], bbox_to_anchor=(1,1), bbox_transform=plt.gcf().transFigure)
plt.show()

Let's calculate the distance to the nearest CTA stations from each apartment using Geopy's vincenty function!

# import geodesic distance calculation function
from geopy.distance import vincenty

# vincenty takes long/lat geographic coordinates only - reproject CTA brown stations
args='ogr2ogr -overwrite -f "ESRI Shapefile" -t_srs EPSG:4326 CTAbrownWGS84.shp CTAbrown.shp'
os.system(args)
sts=ogr.GetDriverByName('ESRI Shapefile').Open('CTAbrownWGS84.shp')
# collect all station IDs & xy coordinates in the two lists
stsId,xy=[],[]
for feature in sts.GetLayer():
    #print feature2.GetField("STATION_ID")
    stsId.append(feature.GetField("STATION_ID"))
    geomSts=feature.GetGeometryRef()
    x=geomSts.GetX()
    y=geomSts.GetY()
    xy.append((y,x))

# print the calculation result
print "\nAPTID PRICE NEAR_STATIONID DISTANCE_MILE"
# for each apartment, calculate distance and identify the nearest station
apt=ogr.GetDriverByName('ESRI Shapefile').Open('apartments.shp')
for feature1 in apt.GetLayer():
    geomApt=feature1.GetGeometryRef()
    x1=geomApt.GetX()
    y1=geomApt.GetY()
    # lambda function lets us iterate vincenty over xy list
    # map() collects and saves the result in the dist list
    dist=map(lambda i:vincenty((y1,x1),i).miles, xy) 
    print feature1.GetField("APTID"),feature1.GetField("PRICE"),stsId[dist.index(min(dist))], format(min(dist),'5.3f')

Optional: GDAL/OGR is not the only python package to read and write spatial objects. PyShp's shapefile library: object model is a lot simpler than OGR, though shapefiles I/O only

# make sure to install PyShp which contain shapefile library
import shapefile

# read GPS point shapefile
gps=shapefile.Reader('GPS.shp')
# PyShp's shapefile: get all records
records=gps.records()
# get the 3rd row/obs and 1-3 variable values
print records[2][0:3]

# plot as simple data objects with points and lines
plt.figure()
for shp in gps.shapeRecords():
    x = [i[0] for i in shp.shape.points[:]]
    y = [i[1] for i in shp.shape.points[:]]
    i1,=plt.plot(x,y, 'ro')
# read road lines and add the lines
roads=shapefile.Reader('roads.shp')
for shp in roads.shapeRecords():
    x = [i[0] for i in shp.shape.points[:]]
    y = [i[1] for i in shp.shape.points[:]]
    i2,=plt.plot(x,y, 'r--', color='blue')
# read studyarea polygon and add the polygon
area=shapefile.Reader('studyarea.shp') 
for shp in area.shapeRecords(): 
    x = [i[0] for i in shp.shape.points[:]] 
    y = [i[1] for i in shp.shape.points[:]]
    plt.plot(x,y, '-', color='green')
    i3,=plt.fill(x, y, 'yellow')
# draw the plot
plt.xlabel('Longitude')
plt.ylabel('Latitude')
# take the scientific notation from xy axis tick labels
ax = plt.gca()
ax.get_xaxis().get_major_formatter().set_useOffset(False)
ax.get_yaxis().get_major_formatter().set_useOffset(False)
# add legend items
legend=plt.legend([i1,i2,i3],['GPS points','Roads','Study area'])
legend.get_frame().set_facecolor('white')
plt.show()

Finally, some raster example - reading and plotting suggestions..

# Read raster data using GDAL
import gdal
raster=gdal.Open('elevation.asc') # it's a simple ASCII text file.

# Learn about the raster data
print "A driver used to open the raster is %s" % raster.GetDriver().GetDescription()
print "There is %d band in the raster." % raster.RasterCount
print "Dimention of the raster is %d and %d" % (raster.RasterXSize, raster.RasterYSize)
# band=raster.GetRasterBand(1) # In GDAL, band numbers start at 1, rather than 0.
# print band.GetMetadata()

# what transformation info is available?
transform=raster.GetGeoTransform()
print "xOrigin     =", format(transform[0],'8.3f')    # upper-left X
print "yOrigin     =", format(transform[3],'8.3f')    # upper-left Y
print "pixelWidth  =", format(transform[1],'8.5f')
print "pixelHeight =", format(transform[5],'8.5f')
xmin,xmax,ymin,ymax=transform[0],transform[0]+transform[1]*raster.RasterXSize,transform[3]+transform[5]*raster.RasterYSize,transform[3]
print "xmin,xmax,ymin,ymax are:", xmin,xmax,ymin,ymax
rarray=raster.ReadAsArray()
print "The lowest elevation value is ", rarray.min()
print "The highest elevation value is ", rarray.max()

# plot image/raster like Matlab/Octave
plt.imshow(rarray,extent=[xmin,xmax,ymin,ymax])
plt.colorbar()

# add the GPS points, as a reference
gps=ogr.GetDriverByName('ESRI Shapefile').Open('GPS.shp')
x,y=[],[]
for feature in gps.GetLayer():
    geomRef=feature.GetGeometryRef()
    x.append(geomRef.GetX()) # get X from each point and add to the list
    y.append(geomRef.GetY()) # get y from each point and add to the list
plt.plot(x,y, 'ro', label='GPS points')
legend=plt.legend(numpoints=1)
legend.get_frame().set_facecolor('white')

# take the scientific notation from xy axis tick labels
ax = plt.gca()
ax.get_xaxis().get_major_formatter().set_useOffset(False)
ax.get_yaxis().get_major_formatter().set_useOffset(False)
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()

Thank you for reading!

Open Source Approach: R

Assuming you have GDAL and GEOS available on your system, we can start geospatial data analysis in R! In the R spatial packages universe, sitting at the top of the hierarchy is sp. The sp package provides classes to describe vector/geometry and raster data, and also includes basic methods. It is my understanding that sp decided how we model geospatial data in R and other R packages follow the models. However, sp does not provide versatile methods/tools we will need, so we augment with additional R spataial packages. First, we should obtain rgdal to read/write geosptail data in various formats and to reproject data (i.e. spatial reference transformation). There are other options for data reading/writing but I think rgdal is the best. Rgeos is also useful in supplying vector anlaysis functions. See the tutorial below for more on spatial packages and their functions.

Note - this tutorial is also posted on RPubs (http://rpubs.com/cmaene/opensourceapproach_r) - in there, outputs/graphics are also included, so it might be easier to follow than this blog post. Also, I am making an accompanying dataset, data.zip, available - please download it from my homepage as noted below..

The first part starts with simple operations - reading and plotting geospatial data.

download.file("http://home.uchicago.edu/~cmaene/data.zip", destfile = "data.zip")
unzip("data.zip")

setwd("data")
install.packages(c("sp","rgdal")) # skip this if you installed these previously
library(sp)

# create spatial point data from XY (GPS-like) table
GPSdata <- read.csv("GPS.csv")
class(GPSdata)              # data frame
GPSpt1 <- SpatialPoints(GPSdata[,c("x","y")])
class(GPSpt1)               # SpatialPoints class - no attributes
summary(GPSpt1)
GPSpt2 <- SpatialPointsDataFrame(GPSdata[,c("x","y")], GPSdata, match.ID=FALSE)
summary(GPSpt2)             # SpatialPointsDataFrame - with attributes
proj4string(GPSpt2) <- CRS("+proj=longlat +datum=NAD83")
summary(GPSpt2)             # SpatialPointsDataFrame - with attributes & SRS

# read shapefiles with GDAL's OGR
library(rgdal)            # upload rgdal!
roads <- readOGR(dsn=".", layer="roads", verbose=TRUE)
class(roads)              # see what consists of the roads object
summary(roads)            # OGR reads .prj/projection info

# read grid/raster data with GDAL
raster <- readGDAL("elevation.asc")
class(raster)             # SpatialGridDataFrame
#plot(raster)             # not what I expected
image(raster)             # better..

# add GPS points and road lines to the raster plot
plot(roads, col="green", add=T)
plot(GPSpt2,  col="blue", add=T)

Let's try classic GIS functions - buffer & overlay.  In this scenario, we will identify apartments within a half-mile distance from the prefered train stations.

install.packages("rgeos", repos="http://cran.us.r-project.org") # skip if installed previously
library(rgeos)
setwd("data")
sts <- readOGR(dsn=".", layer="CTAbrown", verbose=FALSE)
# check spatial/coodinate reference system (srs/crs) information of "sts"
# because the subsequent buffer unit will be determined by it!
proj4string(sts)           #  unit=US feet (us-ft). Hard to tell but it's Illinois State Plane CS, Zone East.
# create buffer - define half-mile (2640 ft) from the train stations
stsBuffer <- gBuffer(sts, width=2640)

# overlay to find how many apartments fall in the train station buffer zones
apts <- readOGR(dsn=".", layer="apartments", verbose=FALSE)
proj4string(apts)           # in WGS84
# reproject "apts" in the same CRS as "sts"
apts <-spTransform(apts, CRS(proj4string(sts)))

# 1: using normal subset with sp objects! kinda cool, I found..
aptsSubset <- apts[stsBuffer,]
class(aptsSubset)           # SpatialPointsDataFrame
head(aptsSubset@data)
plot(stsBuffer)
plot(aptsSubset, col="red", pch=17, add=TRUE)

# 2: using sp's over
aptsOverIndex <-over(apts,stsBuffer)
aptsOver      <-apts[!is.na(aptsOverIndex),]
class(aptsOver)             # SpatialPointsDataFrame

# 3: using rgeos'gIntersection
aptsInside  <-gIntersection(apts, stsBuffer)
aptsOutside <-gDifference(apts, stsBuffer)
# rgeos' gInteresection returns SpatialPOints objects (without data frame/attributes)
class(aptsInside)

# 4: using rgeos' gIntersects - it returns a logical (T/F) matrix
#    which we can use then to create a new SpatialPointsDataFrame (SPDF) object
aptsInsideTF<-gIntersects(apts, stsBuffer, byid=TRUE)
aptsInside  <-apts[as.vector(aptsInsideTF),]  # note: turn matrix to vector for subsetting
class(aptsInside)            # now aptsInside is SPDF

# plot to see if worked
plot(stsBuffer, axes=TRUE)
plot(sts, pch=19, cex=0.5, add=TRUE)
plot(aptsInside, col="red", pch=8, add=TRUE)
plot(aptsOutside, col="blue", pch=3, add=TRUE)
legend("topright", legend=c("CTA stations","Inside","Outside"), cex=0.8, bty="n", lwd=2, col=c("black","red","blue"), pch=c(19,8,3), lty=c(NA,NA,NA))
# add a label - Apartment IDs
labelxy <- coordinates(aptsInside)
text(labelxy,labels=aptsInside@data$APTID, cex=0.8)
# add a title
title(main="Apartments Selected \nwithin 1/2 mile from CTA Brown")

Since we have an appropriate datasets, let's calculate distance with rGEOS - for each of the selected apartment, we want to know the closest CTA Brown station and the distance to it.

# rgeos' gDistance calculates cartesian/Euclidean minimum distance.
# for geodesic distance - consider SDMTools (vincenty method) or other tools.
proj4string(sts)
proj4string(aptsInside)
# calculate distance to the nearest CTA Brown stations
dist <- gDistance(aptsInside, sts, byid=TRUE)
# get the nearest station ID and the distance value (in US-feet)
distMinNearSts <- apply(dist, 2, function(x) which.min(x))
distMinValue   <- apply(dist, 2, function(x) min(x))
distancedata   <- cbind(aptsInside@data, distMinNearSts, distMinValue)
distancedata  # check the result

Next - let's create some maps - choropleth & dot density maps. In addition to the previous two libraries, sp and rgdal, I am adding a color palette library, RColorBrewer - a must have library for pretty cartographic works. We will also load classInt to divide values into classes and maptools to create a dot density map. maptools is a Swiss Army knife like GIS toolset - it often comes in handy to do familiar GIS operations.

install.packages(c("RColorBrewer","classInt","maptools"), repos="http://cran.us.r-project.org") # skip if installed previously
library(classInt)
library(maptools)
library(RColorBrewer)
# here are the available color palette
par(mar = c(1, 3, 1, 1))
display.brewer.all()

setwd("data")
# Chicago census tracts
tracts <-readOGR(dsn=".", layer="tracts", verbose=TRUE)
# define equal-frequency interval class
age_under5 <- tracts@data$AGEU5
nclr <- 5
colors <- brewer.pal(nclr, "YlOrBr")
class <- classIntervals(age_under5, nclr, style="quantile")
colorcode <- findColours(class, colors)

# plot a chropleth map
plot(tracts, col = colorcode)
title(main = "Age Under 5: Quantile (Equal-Frequency)")
legend("topright", legend=names(attr(colorcode, "table")), fill=attr(colorcode, "palette"), cex=0.6, bty="n")

# prepare for a dot density map
hispanic <- tracts@data$HISPANIC
dotper <- hispanic/500
dothispanic <- dotsInPolys(tracts, as.integer(dotper), f="random")
plot(tracts)
plot(dothispanic, col="brown", pch=19, cex=0.2, add=TRUE)
title(main="Dot Density Map: dot=500 persons")

# add dots for black & a legend
black <- tracts@data$BLK
dotper2 <- black/500
dotblack <- dotsInPolys(tracts, as.integer(dotper2), f="random")
plot(dotblack, col="blue", pch=19, cex=0.2, add=T)
legend("bottomleft", legend = c("Hispanic", "Black"), fill=c("brown", "blue"), bty="n")

Raster analysis examples - based on weather station records, we (1) interpolate to create precipitation surface map and then (2) summarize the value by state to calculate the wettest and driest states. We will add raster and gstat packages.

install.packages(c("raster","gstat"), repos="http://cran.us.r-project.org") # skip if installed previously
library(raster)
library(gstat)
setwd("data")
# Part 1: interpolate to create precipitation surface map
weathersts <-readOGR(dsn=".", layer="weatherst", verbose=FALSE)
# create a regular interval (1-degree) grid
ext <-bbox(weathersts)
grid <- expand.grid(x=seq(ext[1,1],ext[1,2],by=1),y=seq(ext[2,1],ext[2,2],by=1))
coordinates(grid)<-c("x","y")   # turn grid1 to a spatial point object
proj4string(grid) <- proj4string(weathersts)
# idw predicts values for all regularly spaced grid by power 2 interpolation
idw <- idw(PYR ~ 1, weathersts, grid, idp=2)
# turn the result into raster grid
idwgrid <- rasterFromXYZ(as.data.frame(cbind(grid$x, grid$y, idw@data$var1.pred)))
# add state lines
states <- readOGR(dsn=".", layer="states48", verbose=FALSE)
plot(idwgrid)
plot(states, add=T)

# Part 2: summarize the interpolated values by state to find the driest and wettest states
meanPYR <- extract(idwgrid, states, fun=mean)
states@data<-cbind(states@data,meanPYR)
nclr<-11
color<-brewer.pal(nclr, "RdYlGn")
class <- classIntervals(states@data$meanPYR, nclr, style="quantile")
colorcode<-findColours(class, color)
plot(states, col=colorcode, axes=TRUE)
# top 5 driest states
sort<-states@data[order(states@data$meanPYR),]
head(sort)
# top 5 wettest states
sort<-states@data[order(states@data$meanPYR, decreasing=TRUE),]
head(sort)

That's all for today!

Open Source Approach: GDAL & GEOS

If you decide to try out spatial analyses outside the proprietary GIS software, like ArcGIS, the first thing you want to do is to install GDAL and GEOS (come with QGIS!) These are the must-have geospatial libraries for those who are planning to analyze spatia data. Many of the geospatial packages we use in R and Python, for example, actually rely on these two libraries - meaning, we pretty much need the libraries first before using spatial packages in R and Python!

GDAL, Geospatial Data Analysis Library, is first and foremost the library we want, in my opinion. GDAL was initially developed by Frank Warmerdam (now with Google I believe) as a suite of tools to handle and analyze the wide range of spatial data file formats via a collection of drivers, or libraries. It comes in, a tiny bit confusingly, two parts, GDAL for raster data and OGR for vector data manipulation. It also comes with a spatial reference class which makes a link to a projection library for defining and transforming between coordinate systems. GDAL also comes with handy utilities we can use to manipulate geospatial data. Below are just a few examples that showcase GDAL/OGR's capabilities.

# check version
gdalinfo --version

# list GDAL supported formats available to you
gdalinfo --formats

# learn about your data - raster example
gdalinfo elevation.asc

# generate tif from raster ASCII ArcInfo Grid
gdal_translate raster.txt raster.tif

# check version (though, it would be the same as GDAL)
ogrinfo  --version

# list OGR supported formats available to you
ogrinfo --formats

# learn about your data - shapefile features summary example
ogrinfo -so Counties.shp -sql "SELECT * FROM Counties"

# translate/convert ASCII ArcInfo generate vector to Shapefiles
ogr2ogr -overwrite -f "ESRI Shapefile" lines.shp lines.txt

# translate/convert Shapefiles to Google KML
ogr2ogr -skipfailures -overwrite -f "KML" lines1.kml lines.shp

# translate/convert Openstreemap OSM format to SQLite/Spatialite DB
# note: this works only with added spatialite capacity
# you also need OSM data downloaded from Openstreetmap 
# + change projection/spatial reference system
# + keep foreign character encoding
ogr2ogr --config OSM_USE_CUSTOM_INDEXING=NO -skipfailures -f "SQLite" -dsco SPATIALLITE=YES overpass.db overpass.osm -overwrite -lco ENCODING=UTF-8  -s_srs "EPSG: 4326" -t_srs "EPSG: 3095"

# spatial analysis with OGR!
# e.g. overlay/intersect analysis - GPS.shp (points) over Counties.shp (polygons)
#      assign census boundary ID's to all GPS points.
#      results.shp will be a point shapefile with attributes from both inputs
ogr2ogr -dialect SQLITE -sql "SELECT ST_Intersection(A.geometry, B.geometry) AS geometry, A.*, B.* FROM Counties A, GPS B WHERE ST_Intersects(A.geometry, B.geometry)" . . -nln results

# get an average apartment lease fee for each census tract (if the tract contains any apartments)
ogr2ogr -dialect SQLITE -sql "SELECT A.geometry AS geometry, A.*, SUM(B.PRICE)/COUNT(B.PRICE) AS meanprice FROM tracts A, apartmentsNEW B WHERE ST_Intersects(A.geometry, B.geometry) GROUP BY A.CTIDFP00" . .  -nln results -overwrite

# calculate the shortest distance to the nearest CTA Brown Line station
ogr2ogr -dialect SQLITE -sql "SELECT A.geometry AS geometry, A.*, MIN(Distance(A.geometry, B.geometry)) AS distance FROM apartmentsNEW A, CTAbrown B GROUP BY A.APTID" . .  -nln results -overwrite

# buffer (1) no merging/dissolving overlapping buffers, keep all individual buffers
ogr2ogr -dialect SQLITE -sql "SELECT Buffer(A.geometry,2640) AS geometry, A.* FROM CTAbrown A GROUP BY A.STATION_ID" . .  -nln results -overwrite

# buffer (2) dissolve overlapping buffers
ogr2ogr -dialect SQLITE -sql "SELECT ST_Union(Buffer(A.geometry,2640)) AS geometry FROM CTAbrown A" . .  -nln results -overwrite

 

GEOS, Geometry Engine - Open Source, handles complex vector/geometric objects and adds advanced GIS functionalities (topological operations), not available in GDAL, and thus nicely compliments GDAL. It is the force behind the major spatial operations - buffer, overlay, distance calculation from spatial objects, etc. - that make "spatial" analysis truly "special". GEOS doesn't include stand alone utilities (aside from geos-config) unlike GDAL. Instead, its functionalities are accessed through a variety of applications including R packages (e.g. rgeos) and Python packages (e.g. Shapely).

Installing these libraries can be tricky as there are so many options to do so  - from compiling and installing from source (most flexible) to downloading and installing binaries or packages from repositories. The easiest way, I found, is to simply install QGIS desktop, the most popular open source desktop GIS which will automatically install GDAL and GEOS libraries under the QGIS installation path. To see if GDAL and GEOS are properly installed, test the following commands in R or Python. If you don't get an error message, you are good to go!

  • R:  rgdal & rgeos are the R binding for the GDAL & GEOS libraries
    install.packages(c("rgdal","rgeos"))
    library(rgdal); library(rgeos)	
  • Python (version 2.7 is the current version of QGIS Python console):
    from osgeo import gdal, ogr
    #(or simply: import gdal, ogr)
    

Disclaimer: please install the libraries and applications at your own risk as individual situations can vary widely depending on operating systems, versions, etc. Luckily, there are many helpful information on the web on this subject but I am also adding a few of my personal notes relating to the installation issues below.

  • Windows users: after installing QGIS, if the commands above causes errors in R or Python, check (and/or set) PATH (C:\\Program Files\QGIS xxx\bin) and PYTHONPATH (C:\\Program Files\QGIS xxx\apps\Python27\Lib\site-packages) environment variables. Note: be aware that setting PYTHONPATH may affect ArcGIS tools if ArcGIS is on the same Windows (remove/modify PYTHONPATH env.variable accordingly if you need to use ArcGIS tools.)
  • In Installing on Windows, keep in mind on which system your libraries and applications (R/Python) are. i.e. 32-bit vs. 64-bit? I.e. we can't use 64-bit version of libraries on 32-bit applications..
  • Linux users may want to cmpile and install GDAL from source to maximize its capabilities, such as adding more format drivers or choosing a version. I installed from source to read/write File Geodatabases (FileGDB), spatialite/sqlite, and advanced google KML files (LIBKML).

I use 64-bit Windows 7 and Ubuntu/Linux. Let me know if you encounter problems on these two systems..