Introduction to Data Analysis

10.2. Choropleth maps

The code in this section ends on examples of choropleth maps of Barack Obama's vote shares in the 2008 and 2012 U.S. presidential elections. The operations covered before the final result include the usual data import routines as well as a few replication plots. Contrarily to our first map, we do not code a plot function as to leave the full ggplot2 syntax apparent.

packages <- c("downloader", "ggplot2", "maps", "RColorBrewer", "scales", "XML")
packages <- lapply(packages, FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x)
library(x, character.only = TRUE)
}
})


Two-party vote shares in the United States

This example is based on an example map by David Sparks. It shows how to map U.S. state-level electoral data collected by David Wasserman. The next code block will start by downloading the data and importing a selection of rows and columns from the CSV file. The only noteworthy step is that we remove the District of Columbia to un-skew later distributions.

# David Wasserman's data.
mode = "wb")
# Import selected rows.
dw <- read.csv(file, stringsAsFactors = FALSE, skip = 4)[-c(1:6, 19:20, 28),
-c(4, 7:8)]
# Check result.
str(dw)

'data.frame':   50 obs. of  5 variables:
$State : chr "Colorado*" "Florida*" "Iowa*" "Michigan*" ...$ Obama..08 : chr  "1,288,633" "4,282,074" "828,940" "2,872,579" ...
$McCain..08: chr "1,073,629" "4,045,624" "682,379" "2,048,639" ...$ Obama..12 : chr  "1,323,101" "4,237,756" "822,544" "2,564,569" ...
$Romney..12: chr "1,185,243" "4,163,447" "730,617" "2,115,256" ...  The next clode block cleans the dataset by removing punctuation and other text symbols from the variable names and values. It then adds a logical marker to discriminate the first twelve states in the data, which are treated as swing (“battleground”) states in the plots. The list of 'swing states' used id slightly different from the one used by Simon Jackman. # Fix dots in names. names(dw) <- gsub("\\.\\.", "", names(dw)) # Remove characters. dw <- data.frame(gsub("\\*|,|%", "", as.matrix(dw)), stringsAsFactors = FALSE) # Make data numeric. dw[, -1] <- data.matrix(dw[, -1]) # Create marker for swing states. dw$Swing <- FALSE
# Mark first twelve states.
dw$Swing[1:12] <- TRUE # Check result. dw[1:15, ]   State Obama08 McCain08 Obama12 Romney12 Swing 7 Colorado 1288633 1073629 1323101 1185243 TRUE 8 Florida 4282074 4045624 4237756 4163447 TRUE 9 Iowa 828940 682379 822544 730617 TRUE 10 Michigan 2872579 2048639 2564569 2115256 TRUE 11 Minnesota 1573354 1275409 1546167 1320225 TRUE 12 Nevada 533736 412827 531373 463567 TRUE 13 New Hampshire 384826 316534 369561 329918 TRUE 14 North Carolina 2142651 2128474 2178391 2270395 TRUE 15 Ohio 2940044 2677820 2827710 2661433 TRUE 16 Pennsylvania 3276363 2655885 2990274 2680434 TRUE 17 Virginia 1959532 1725005 1971820 1822522 TRUE 18 Wisconsin 1677211 1262393 1620985 1407966 TRUE 21 Alabama 813479 1266546 795696 1255925 FALSE 22 Alaska 123594 193841 122640 164676 FALSE 23 Arizona 1034707 1230111 1025232 1233654 FALSE  We imported the data without the precalculated two-party vote shares (VS), in order to run the formulas on our end. The first variable codes a state “blue” or “red” based on party victory. The last variable measures the size of the swing that would have been required for a given state to change hands (for Romney to win). # Obama victory margins, using two-party vote. dw <- within(dw, { State_Color <- ifelse(Obama08 > McCain08, "Blue", "Red") # Margin in 2008. Total_VS_08 <- Obama08 + McCain08 Obama_VS_08 <- 100 * Obama08/Total_VS_08 # Margin in 2012. Total_VS_12 <- Obama12 + Romney12 Obama_VS_12 <- 100 * Obama12/Total_VS_12 # Obama swing in two-party vote share. Obama_Swing <- Obama_VS_12 - Obama_VS_08 # Swing required for state to change hands. Rep_Wins <- 100 * (Romney12 - Obama12)/Total_VS_12 }) # Check results. str(dw)  'data.frame': 50 obs. of 13 variables:$ State      : chr  "Colorado" "Florida" "Iowa" "Michigan" ...
$Obama08 : num 1288633 4282074 828940 2872579 1573354 ...$ McCain08   : num  1073629 4045624 682379 2048639 1275409 ...
$Obama12 : num 1323101 4237756 822544 2564569 1546167 ...$ Romney12   : num  1185243 4163447 730617 2115256 1320225 ...
$Swing : logi TRUE TRUE TRUE TRUE TRUE TRUE ...$ Rep_Wins   : num  -5.496 -0.885 -5.919 -9.601 -7.882 ...
$Obama_Swing: num -1.803 -0.977 -1.889 -3.571 -1.288 ...$ Obama_VS_12: num  52.7 50.4 53 54.8 53.9 ...
$Total_VS_12: num 2508344 8401203 1553161 4679825 2866392 ...$ Obama_VS_08: num  54.6 51.4 54.8 58.4 55.2 ...
$Total_VS_08: num 2362262 8327698 1511319 4921218 2848763 ...$ State_Color: chr  "Blue" "Blue" "Blue" "Blue" ...


We now plot a first overview of the data, based on Simon Jackman's analysis of similar figures. The list of swing states and the actual swings are slightly different in his visualization, and only one $$x$$-axis is scaled in our rendering because ggplot2 enforces the Stephen Few recommendation against dual scales.

# Order plot by states.
dw$State <- with(dw, reorder(State, Obama_Swing), ordered = TRUE) # Dot plot. ggplot(dw, aes(y = State, x = Obama_Swing)) + geom_vline(x = c(0, mean(dw$Obama_Swing)), size = 4, color = "grey95") +
geom_point(aes(colour = ifelse(Obama08 > McCain08, "Dem", "Rep")), size = 5) +
geom_point(data = subset(dw, Swing), aes(x = Rep_Wins), size = 5, shape = 1) +
scale_x_continuous(breaks = -10:4) +
scale_colour_manual("2008", values = brewer.pal(3, "Set1")[c(2, 1)]) +
labs(y = NULL, x = NULL, title = "Obama Swing in Two Party Vote Share\n")


In the plot above, the first grey line is the average swing in vote share in 2012, and the second one marks the zero-swing point. The black points are the theoretical swing points at which the battleground states would have gone Republican. Read Simon Jackman's analysis for more on the topic.

Let's also replicate Simon Jackman's second plot, showing the swing from 2008 to 2012 against Obama's vote share in 2008 weighted by electoral college votes. The first step for this plot is to get the number of electoral college voters per state as well as state abbreviations, both from Wikipedia. The data are merged to the principal data frame.

# Electoral college votes, 2012.
url = "http://en.wikipedia.org/wiki/Electoral_College_(United_States)"
# Extract fifth table.
college <- readHTMLTable(url, which = 4, stringsAsFactors = FALSE)
# Keep first and last columns, removing total electors.
college <- data.frame(State = college[, 1], College = as.numeric(college[, 35]))
# Merge to main dataset.
dw <- merge(dw, college, by = "State")
# U.S. states codes.
url = "http://en.wikipedia.org/wiki/List_of_U.S._states"
# Extract fifth table.
uscodes <- readHTMLTable(url, which = 1, stringsAsFactors = FALSE)
# Keep first and last columns, removing total electors.
uscodes <- data.frame(State = gsub("\$[A-Z]+\$", "", uscodes[, 1]), Abbreviation = uscodes[,
4])
# Merge to main dataset.
dw <- merge(dw, uscodes, by = "State")
# Check result.
str(dw)

'data.frame':   50 obs. of  15 variables:
$State : Factor w/ 50 levels "Utah","West Virginia",..: 42 50 43 25 41 27 19 16 36 29 ...$ Obama08     : num  813479 123594 1034707 422310 8274473 ...
$McCain08 : num 1266546 193841 1230111 638017 5011781 ...$ Obama12     : num  795696 122640 1025232 394409 7854285 ...
$Romney12 : num 1255925 164676 1233654 647744 4839958 ...$ Swing       : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
$Rep_Wins : num 22.43 14.63 9.23 24.31 -23.75 ...$ Obama_Swing : num  -0.325 3.749 -0.299 -1.983 -0.406 ...
$Obama_VS_12 : num 38.8 42.7 45.4 37.8 61.9 ...$ Total_VS_12 : num  2051621 287316 2258886 1042153 12694243 ...
$Obama_VS_08 : num 39.1 38.9 45.7 39.8 62.3 ...$ Total_VS_08 : num  2080025 317435 2264818 1060327 13286254 ...
$State_Color : chr "Red" "Red" "Red" "Red" ...$ College     : num  9 3 11 6 55 9 7 3 29 16 ...
$Abbreviation: Factor w/ 49 levels "Albuquerque",..: 6 2 41 28 29 17 9 49 25 3 ...  The final plot follows. It confirms that Obama won by protecting the battleground states, losing only two states in the overall swing. Also, if you have not started Nate Silver's blog yet, now might be the time, starting with swing voters versus elastic states. # Swing vs. Vote Share, weighted by Electoral College Votes. ggplot(dw, aes(y = Obama_Swing, x = Obama_VS_08)) + geom_rect(xmin = 50, xmax = Inf, ymin = -Inf, ymax = Inf, alpha = 0.3, fill = "grey95") + geom_point(aes(color = Romney12 > Obama12, size = College), alpha = 0.6) + geom_text(colour = "white", label = ifelse(dw$Swing,
as.character(dw$Abbreviation), NA)) + scale_colour_manual("2008", values = brewer.pal(3, "Set1")[c(2, 1)]) + scale_size_area(max_size = 42) + labs(y = "Obama Swing in Two Party Vote Share", x = "Obama 2008 Vote Share") + theme(legend.position = "none")  Mapping the electoral swing To map the swing from 2008 to 2012, we load U.S. geographical data and extract its state names. The method corresponds to the code provided by David Sparks: it takes a map object provided in the maps package and adds the variables of interest to it, using the region variable as the unique identifier for U.S. states. # Load state shapefile from maps. states.data <- map("state", plot = FALSE, fill = TRUE) # Convert shapes to a data frame. states.data <- fortify(states.data) # Extract states from data frame. states.list <- sort(unique(states.data$region))
# Exclude Washington D.C. (sorry).
states.list = states.list[-which(grepl("columbia", states.list))]
# Subset to map states (sorry Alaska).
dw = subset(dw, tolower(State) %in% states.list)
# Transpose data to map dataset.
states.data$SwingBO <- by(dw$Obama_Swing, states.list, mean)[states.data$region] states.data$Obama08 <- by(dw$Obama_VS_08, states.list, mean)[states.data$region]
states.data$Obama12 <- by(dw$Obama_VS_12, states.list, mean)[states.data$region]  The plot is going to show quintiles of the Obama swing from 2008 to 2012. To make the code shorter, the quintiles are calculated by a short function, and the plots use a common ggplot2 structure. Most of the graph options are set to make the plot blank. The coordinates of the plot are made conic to curve the map correctly. # Choropleth map function. ggchoro <- function(x, q = 5, title = NULL) { x = states.data[, x] states.data$q = cut(x, breaks = quantile(round(x),
probs = 0:q/q,
na.rm = TRUE),
include.lowest = TRUE)
ggplot(states.data,
aes(x = long,
y = lat,
group = group,
fill = q)) +
geom_polygon(colour = "white") +
coord_map(project = "conic", lat0 = 30) +
scale_fill_brewer("", palette = "RdYlBu") +
labs(y = NULL, x = NULL, title = title) +
theme(panel.border = element_rect(color = "white"),
axis.text = element_blank(),
axis.ticks = element_blank())
}
# Choropleth maps.
ggchoro("SwingBO", title = "Swing in the Obama vote share")


ggchoro("Obama08", title = "Obama vote share, 2008")


ggchoro("Obama12", title = "Obama vote share, 2012")


David Sparks has also posted code for simpler maps with less data wrangling, and chloropleth maps with more precise data. In the future, there's a chance that he will also post his code for isarithmic map that look like this: