Crime in Downtown Houston, Texas : Combining ggplot2 and Google Maps - brodieG/ggplot2 GitHub Wiki
By David Kahle, Ph. D. Candidate, Department of Statistics, Rice University
The current work details some new methods for the visualization of spatial data in R using ggplot2 and applies the new methods to the analysis of violent crimes in downtown Houston, Texas. In particular, this article combines the power of ggplot2 and Google Maps to provide a nice exploratory analysis of hotspots of criminal activity in and around downtown Houston.
Crime data were compiled from the Houston Police Department’s website at http://www.houstontx.gov/police/cs/stats2.htm as far back as June of 2009. The data were lightly cleaned, aggregated using the plyr package, and geocoded using Google Maps (to the center of the block, e.g., 6150 Main St.); the full data set used is available on my website at https://sites.google.com/site/davidkahle/online-resources/ggplot2-2010-case-study. The analysis performed only concerns data from January 1, 2010 through August 31, 2010 on the violent crimes of aggravated assault, robbery, rape and murder. Note that while some effort was made to try to ensure the quality of the data, the data were only leisurely cleaned for my personal interest, so the data set may still contain errors.
The tile geom in ggplot2 is a flexible tool which can be used to plot raster images using the ggplot2 engine. The main idea is to load an image into R using the read.jpeg function in the ReadImages package (say), convert the array of rgb colors to a matrix of hexadecimal strings, and reshape the matrix into a rectangular data frame with one column for the pixel row index, one column for the pixel column index, and then a last column for the pixel fill color. Using this data frame, images can then be plotted by combining the tile geom with scale_fill_identity. This is precisely what the ggimage function does. A graphic demonstrating how the ggimage command works is provided below on the artwork of Garrett Grolemund; the final step of the ggimage function simply removes the axes and margins.
NB – Making tiled images is very intensive because there are many tiles to plot. For this reason I prefer saving the plots by taking screenshots and then simply cropping out what I want. This way I can be sure to get exactly the image I see on the screen.
Using the tile geom in this way can be used for static maps as well. The primary function used for this task is the ggooglemap function, which can be thought of as sophisticated wrapper for the GetMap function from Markus Loecher’s RgoogleMaps package tailored for use with ggplot2. Specifically, the function grabs a static map of a given type (satellite, road, etc.) from Google Maps in a convenient way, references each pixel of the image geographically, and returns a data frame containing variables lon, lat, and fill corresponding to the longitude/latitude of each pixel and the fill color in hexadecimal. The result is a data frame which can be used to make a tile layer depicting the Google Map of the desired area upon which one can use the full range of ggplot2 features – all aesthetics and geoms save the fill, which has already been used for the map layer. As we see with the weather map below, even the fill aesthetic can be used if the scales are trained separately.
NB – To use the ggooglemap function for yourself, you must obtain a Google Maps API key. You can get an API key free of charge from http://code.google.com/apis/maps/signup.html.
In this work we are interested in determining areas with relatively high levels of violent criminal activity in and around downtown Houston. To this end, we restrict ourselves exclusively to the violent crimes of aggravated assault, robbery, rape, and murder. As a spatial reference point, we consider downtown to be the area immediately surrounding city hall (which can be geocoded with Google Earth).
The first step is to grab and plot the map which best represents the geographical extent of interest. This extent is portrayed in the figure below using the methods described above. It is this area we consider to be downtown Houston in this article.
The stage is now set : we have the map from google maps and we have the crime data from the Houston Police Department which has been geocoded using Google Maps. We can now combine the two in any way we want. To begin, we use a geom_point layer to pinpoint every individual incident of violent crime in downtown Houston since the beginning of the year. Since these crimes are provided by the Houston Police Department by city block and later geocoded to the middle of the block, a significant amount of overplotting occurs. Thus, the points are jittered to reduce the effects of overplotting. While we could use the alpha aesthetic for such a task, we stop short at the jittering because the alpha aesthetic diminishes the visibility of each of the points. Since the points are being plotted on top of a fairly heterogeneous set of colored tiles, we need to enhance their visibility as much as possible.
The geom_point layer helps us to see individual incidences of violent criminal activity, but it isn’t as useful as an at-a-glance description of crime downtown. For this, a contour layer is used (disregarding crime type).
While the “Violent Crime Density” is not very meaningful in terms of a quantity which can be readily interpreted, the plot is useful for recognizing relative hotspots, of which there appear to be three. Each of these three correspond to locations commonly held by Houstonians to be particularly dangerous locations. The lesser of these, near the intersection of Westheimer and Montrose, is known to have high rates of prostitution. The next most dangerous of these, near West Gray and Main, is known to be a criminal hotspot because of a Greyhound Lines bus station located there. The most serious hotspot however is located to the north, around the intersection of Congress and Louisiana on the north side of downtown. This spot is less than half a mile (five blocks or so) from the county jail. Individuals are released from the jail daily at midnight and noon and tend to loiter in the nearby area.
Contour plots can be difficult for untrained people to understand. Fortunately, everyone is familiar with weather maps and what they mean. Thus, making weather maps to convey areas of heightened criminal activity can be preferable to the contour plot provided above. Unfortunately, creating good weather maps presents a range of difficulties. Methodologically, the natural way to make a weather overlay with ggplot2 is to add another geom_tile layer and fill the tiles according to the crime density. That being said, the fill aesthetic is already in use, so the fill scale must be trained a priori. While it violates the basic principles of the grammar of graphics, these kinds of weather maps can be made using a combination of pre-training scales and careful alpha blending. Like the contour plots, their precise meaning is ambiguous; however, the plot effectively highlights regions of enhanced criminal activity. In order to diminish the visual confusion caused by the two fill scales, the weather map is inlaid with a blank map with the same weather overlay. This inlay can then be used to more accurately understand the weather map.
The map is by no means perfect. For instance, regions of high activity are obscured by the overlay itself. We can try to limit this effect by limiting the alpha value to .9 (say) instead of 1; however, the more we limit the alpha value, the less useful the weather overlay becomes, so it is important to try to strike a useful balance.
In this case study we have used the power of ggplot2 combined with the detailed spatial information of Google Maps to provide a useful analysis of violent crime hotspots in downtown Houston, Texas. Using Google Maps in R with ggplot2 is beneficial in myriad ways. First and foremost, it allows analysts to visualize their data on a familiar geographic scale which is kept consistent by the ggplot2 framework. It is far easier to interpret spatial information when referenced by common landmarks than by political boundaries such as county lines or census tracts (which is oftentimes how we would view this data in R otherwise). Second, it allows us to better understand and check spatial data from different sources. Although not a topic in this case study, it is an easy exercise to extract data from shape files for use with the above framework, an example of which is below (code not provided). Third, it allows the analyst to make graphics which are much more readily interpretable by third parties. And finally, it just looks good.
One last point might be helpful. In making these kinds of plots, one might tempted to use the map raster file itself as a background. This method can be used to make map plots much more quickly than the methods described above. However, the method has one very significant disadvantage which, if not handled properly, can destroy the entire purpose of using the map. Specifically, by forcing the map image to be the background of the plot, the user bypasses the spatial referencing system dictated by the x and y-axes scales which are used to ensure consistency in spatial referencing. If one specifies a background image, ggplot2 doesn’t care whether the background image is a picture of Mickey Mouse or a Google Map – whatever it is it will use it to fill the plot panel. Moreover, the image will be stretched to fill the the panel window regardless of its shape, completely disregarding the native dimensions of the map, and nullifying its use. Thus, while this difficulty can likely be overcome, the methods described above fit in more naturally with the ggplot2 framework.
I would like to thank Markus Loecher of Sense Networks and Hadley Wickham of Rice University each for a series of emails over a year ago which helped me to understand how the RgoogleMaps package works and the independent training of scales used to make the weather maps possible (respectively). While the scale trick (i.e., rigging two different fill scales a priori and then forcing them onto the plot) completely violates the grammar, it is nevertheless very useful and far superior to the geom_point based alternative I had been using.
library(ggplot2)
library(ReadImages)
library(RgoogleMaps)
library(MASS)
theme_set(theme_bw())
################################################################################
#################### preload functions for later use ####################
################################################################################
ggimage <- function(image){
require(ggplot2)
if(length(dim(image)) == 2){
message('creating black and white image...')
image <- melt(image)
names(image) <- c('row','column','fill')
plot <- qplot(column, -row, data = image, geom = 'tile', fill = fill) +
scale_fill_gradient(low = 'black', high = 'white')
}
if(length(dim(image)) == 3){
message('creating color image...')
image <- apply(image, 1:2, function(v) rgb(v[1], v[2], v[3]))
image <- melt(image)
names(image) <- c('row', 'column', 'fill')
plot <- qplot(column, -row, data = image, geom = 'tile', fill = fill) +
scale_fill_identity()
}
#return(plot) # remove first pound for the image in the case study
plot +
opts(
axis.line = theme_blank(), axis.ticks = theme_blank(),
axis.text.x = theme_blank(), axis.text.y = theme_blank(),
axis.title.x = theme_blank(), axis.title.y = theme_blank(),
axis.ticks.length = unit(0, "lines"),
axis.ticks.margin = unit(0, "lines"),
legend.position = "none",
panel.background = theme_blank(),
panel.border = theme_blank(),
panel.grid.major = theme_blank(),
panel.grid.minor = theme_blank(),
panel.margin = unit(0, "lines"),
plot.background = theme_blank(),
plot.title = theme_blank(),
plot.margin = unit(c(-1, -1, -1.5, -1.5), "lines")
)
}
ggooglemap <- function(location = 'houston',
center = c(lat = 29.7632836, lon = -95.3632715), API,
type = c('color','bw')[1], rgbcoefs = c(0, 1, 0), zoom = 10,
maptype = 'terrain',
destfile = 'TemporaryMap.jpg', n_pix = 640)
{
require(ggplot2)
require(RgoogleMaps)
require(ReadImages)
if(!missing(location)){
url_string <- paste('http://maps.google.com/maps/geo?q=', location, sep = '')
site <- readLines(url(url_string))
site <- site[which(regexpr('coordinates', site) > 0)]
if(is.na(site)) stop('location geocoding error.')
site <- strsplit(site, '\\[')[[1]][2]
site <- strsplit(site, ',')[[1]][1:2]
latlon <- as.numeric(site)
center <- c(lat = latlon[2], lon = latlon[1])
closeAllConnections()
}
if(missing(API)) API <- '' # ENTER YOUR API HERE
# get map
GetMap(API, center = center[c('lat','lon')],
size = c(n_pix, n_pix), zoom = zoom, format = 'jpg',
maptype = maptype, destfile = destfile)
# load map
map <- read.jpeg(destfile)
# deal with color
if(type == 'color'){
map <- apply(map, 1:2, function(v) rgb(v[1], v[2], v[3]))
} else if(type == 'bw') {
nrow <- nrow(map)
ncol <- ncol(map)
map <- grey(rgb2grey(map, coefs = rgbcoefs))
map <- matrix(map, nrow = nrow, ncol = ncol)
} else {
stop('type must be either \'color\' or \'bw\'', call. = FALSE)
}
# reshape map for plotting
m_map <- melt(map)
names(m_map) <- c('x','y','fill')
m_map <- within(m_map,{
x <- x - n_pix/2 - 1
y <- y - n_pix/2 - 1
})
mapInfo <- list(lat = center['lat'], lon = center['lon'], zoom = zoom, map)
XY_cent <- LatLon2XY.centered(mapInfo, center['lat'], center['lon'])
#XY2LatLon(HouMapInfo, XY_cent$newX, XY_cent$newY)
# geocode pixel references
s <- (-n_pix/2) : (n_pix/2 - 1)
lat_wrapper <- function(x) XY2LatLon(mapInfo, -n_pix/2, x)[1]
lats <- apply(data.frame(s), 1, lat_wrapper)
lon_wrapper <- function(y) XY2LatLon(mapInfo, y, -n_pix/2)[2]
lons <- apply(data.frame(s), 1, lon_wrapper)
# merge colors to latlons and return
df_xy <- expand.grid(x = s, y = s)
df_ll <- expand.grid(lat = rev(lats), lon = lons)
df_xyll <- data.frame(df_xy, df_ll)
df <- suppressMessages(join(df_xyll, m_map, type = 'right'))
df <- df[,c('lon','lat','fill')]
df
}
theme_nothing <- function (base_size = 12){
structure(list(
axis.line = theme_blank(),
axis.text.x = theme_blank(), axis.text.y = theme_blank(),
axis.ticks = theme_blank(),
axis.title.x = theme_blank(), axis.title.y = theme_blank(),
axis.ticks.length = unit(0, "lines"), axis.ticks.margin = unit(0, "lines"),
legend.position = "none",
panel.background = theme_rect(fill = 'white'),
panel.border = theme_blank(),
panel.grid.major = theme_blank(), panel.grid.minor = theme_blank(),
panel.margin = unit(0, "lines"),
plot.background = theme_rect(colour = 'white'),
plot.title = theme_text(size = base_size * 1.2),
plot.margin = unit(c(-1, -1, -1.5, -1.5), "lines")
), class = "options")
}
vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
################################################################################
#################### hadley's picture ####################
################################################################################
hadley <- read.jpeg('hadley.jpg')
# for the axes, remove the pound sign in the ggimage function above
ggimage(hadley) + coord_equal()
################################################################################
#################### making a map ####################
################################################################################
CityHall_latlon <- c(lon = -95.369318, lat = 29.760210)
DowntownMap <-ggooglemap(center = CityHall_latlon, zoom = 14)
qplot(lon, lat, data = DowntownMap, geom = 'tile', fill = fill) +
scale_fill_identity() +
scale_x_continuous('Longitude') +
scale_y_continuous('Latitude') +
coord_equal() +
opts(title = 'Terrain Map of Downtown Houston')
################################################################################
#################### houston crime ####################
################################################################################
load('HoustonCrime.Rdata') # variable name : crime_df
# restrict to violent crimes
violent_crimes <- subset(crime_df,
offense != 'Auto Theft' & offense != 'Theft' & offense != 'Burglary'
)
# restrict to year 2010 (january - august)
violent_crimes <- subset(violent_crimes,
time >= ISOdatetime(2010, 1, 1, 0, 0, 0)
)
# grab downtown google map
CityHall_latlon <- c(lon = -95.369318, lat = 29.760210)
DowntownMap <-ggooglemap(center = CityHall_latlon, zoom = 14)
lat_range <- range(DowntownMap$lat)
lon_range <- range(DowntownMap$lon)
# contour plot
ggplot() +
geom_tile(aes(x = lon, y = lat, fill = fill), data = DowntownMap) +
scale_fill_identity() +
geom_density2d(aes(x = lon, y = lat, colour = ..level..),
bins = I(10), fill = NA, alpha = I(1/2), size = I(.75), data = violent_crimes) +
scale_colour_gradient2('Violent\nCrime\nDensity',
low = 'darkblue', mid = 'orange', high = 'red', midpoint = 900) +
scale_x_continuous('Longitude', limits = lon_range) +
scale_y_continuous('Latitude', limits = lat_range) +
opts(title = 'Violent Crime Contour Map of Downtown Houston, 2010') +
coord_equal()
# point plot
violent_crimes <- subset(violent_crimes,
lon_range[1] <= lon & lon <= lon_range[2] &
lat_range[1] <= lat & lat <= lat_range[2]
) # cuts out the warning for missing values in geom_point
ggplot() +
geom_tile(aes(x = lon, y = lat, fill = fill), data = DowntownMap) +
scale_fill_identity() +
geom_jitter(aes(x = lon, y = lat, colour = offense, size = offense),
fill = NA, alpha = I(3/4), data = violent_crimes,
position = position_jitter(width = .001, height = .001)) +
scale_x_continuous('Longitude', limits = lon_range) +
scale_y_continuous('Latitude', limits = lat_range) +
scale_colour_discrete('') +
scale_size_manual('', values = c(
'Robbery' = 2, 'Aggravated Assault' = 2.5, 'Rape' = 3, 'Murder' = 4
)) +
opts(title = 'Violent Crime Map of Downtown Houston, 2010') +
coord_equal()
# weather map
load('HoustonCrime.Rdata') # variable name : crime_df
violent_crimes <- subset(crime_df,
offense != 'Auto Theft' & offense != 'Theft' & offense != 'Burglary'
)
violent_crimes <- subset(violent_crimes,
time >= ISOdatetime(2010, 1, 1, 0, 0, 0)
)
CityHall_latlon <- c(lon = -95.369318, lat = 29.760210)
DowntownMap <-ggooglemap(center = CityHall_latlon, zoom = 14, maptype = 'hybrid')
lat_range <- range(DowntownMap$lat)
lon_range <- range(DowntownMap$lon)
vclatlon <- violent_crimes[,c('lon','lat')]
vclatlon <- na.omit(violent_crimes[,c('lon','lat')])
vclatlon <- subset(vclatlon,
lon_range[1] <= lon & lon <= lon_range[2] &
lat_range[1] <= lat & lat <= lat_range[2]
)
den <- kde2d(vclatlon$lon, vclatlon$lat, n = 320,
lims = c(lon_range, lat_range))
kde_df <- expand.grid(
lon = seq.int(lon_range[1], lon_range[2], length.out = 320),
lat = seq.int(lat_range[1], lat_range[2], length.out = 320)
)
kde_df$density <- melt(den$z)$value
summary(kde_df$density)
den_fill_scale <- scale_colour_gradient2(low = 'white', mid = 'darkgreen',
high = 'red', midpoint = 225)
den_fill_scale$train(kde_df$density, T)
kde_df$density_s <- den_fill_scale$map(kde_df$density)
kde_df$density_zeroone <- pmin(kde_df$density / max(kde_df$density), .9)
big_plot <- ggplot() +
geom_tile(aes(x = lon, y = lat, fill = fill), data = DowntownMap) +
geom_tile(aes(x = lon, y = lat, fill = density_s, alpha = density_zeroone),
data = kde_df) +
scale_x_continuous('Longitude', limits = lon_range) +
scale_y_continuous('Latitude', limits = lat_range) +
scale_alpha(to = c(0, .9)) +
scale_fill_identity() +
opts(
legend.position = 'none',
title = 'Violent Crime Weather Map of Downtown Houston, 2010'
) +
coord_equal()
little_plot <- ggplot() +
geom_tile(aes(x = lon, y = lat, fill = density_s, alpha = density_zeroone),
data = kde_df) +
scale_alpha(to = c(0, .9)) +
scale_fill_identity() +
coord_equal() +
theme_nothing()
grid.newpage()
pushViewport(viewport(layout = grid.layout(4,4)))
print(big_plot, vp = vplayout(1:4, 1:4))
print(little_plot, vp = vplayout(4, 4))
grid.newpage()
pushViewport(viewport(layout = grid.layout(1000,1000)))
print(big_plot, vp = vplayout(1:1000, 1:1000))
print(little_plot, vp = vplayout(697:897, 740:940))