Introduction to Data Analysis

10.2. Choropleth maps

The code in this section ends on examples of choropleth maps of Barack Obama's vote shares in the 2008 and 2012 U.S. presidential elections. The operations covered before the final result include the usual data import routines as well as a few replication plots. Contrarily to our first map, we do not code a plot function as to leave the full ggplot2 syntax apparent.

packages <- c("downloader", "ggplot2", "maps", "RColorBrewer", "scales", "XML")
packages <- lapply(packages, FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        install.packages(x)
        library(x, character.only = TRUE)
    }
})

Two-party vote shares in the United States

This example is based on an example map by David Sparks. It shows how to map U.S. state-level electoral data collected by David Wasserman. The next code block will start by downloading the data and importing a selection of rows and columns from the CSV file. The only noteworthy step is that we remove the District of Columbia to un-skew later distributions.

# David Wasserman's data.
link = "https://docs.google.com/spreadsheet/pub?key=0AjYj9mXElO_QdHpla01oWE1jOFZRbnhJZkZpVFNKeVE&gid=0&output=csv"
# Download the spreadsheet.
if (!file.exists(file <- "data/wasserman.votes.0812.csv")) download(link, file, 
    mode = "wb")
# Import selected rows.
dw <- read.csv(file, stringsAsFactors = FALSE, skip = 4)[-c(1:6, 19:20, 28), 
    -c(4, 7:8)]
# Check result.
str(dw)
'data.frame':   50 obs. of  5 variables:
 $ State     : chr  "Colorado*" "Florida*" "Iowa*" "Michigan*" ...
 $ Obama..08 : chr  "1,288,633" "4,282,074" "828,940" "2,872,579" ...
 $ McCain..08: chr  "1,073,629" "4,045,624" "682,379" "2,048,639" ...
 $ Obama..12 : chr  "1,323,101" "4,237,756" "822,544" "2,564,569" ...
 $ Romney..12: chr  "1,185,243" "4,163,447" "730,617" "2,115,256" ...

The next clode block cleans the dataset by removing punctuation and other text symbols from the variable names and values. It then adds a logical marker to discriminate the first twelve states in the data, which are treated as swing (“battleground”) states in the plots. The list of 'swing states' used id slightly different from the one used by Simon Jackman.

# Fix dots in names.
names(dw) <- gsub("\\.\\.", "", names(dw))
# Remove characters.
dw <- data.frame(gsub("\\*|,|%", "", as.matrix(dw)), stringsAsFactors = FALSE)
# Make data numeric.
dw[, -1] <- data.matrix(dw[, -1])
# Create marker for swing states.
dw$Swing <- FALSE
# Mark first twelve states.
dw$Swing[1:12] <- TRUE
# Check result.
dw[1:15, ]
            State Obama08 McCain08 Obama12 Romney12 Swing
7        Colorado 1288633  1073629 1323101  1185243  TRUE
8         Florida 4282074  4045624 4237756  4163447  TRUE
9            Iowa  828940   682379  822544   730617  TRUE
10       Michigan 2872579  2048639 2564569  2115256  TRUE
11      Minnesota 1573354  1275409 1546167  1320225  TRUE
12         Nevada  533736   412827  531373   463567  TRUE
13  New Hampshire  384826   316534  369561   329918  TRUE
14 North Carolina 2142651  2128474 2178391  2270395  TRUE
15           Ohio 2940044  2677820 2827710  2661433  TRUE
16   Pennsylvania 3276363  2655885 2990274  2680434  TRUE
17       Virginia 1959532  1725005 1971820  1822522  TRUE
18      Wisconsin 1677211  1262393 1620985  1407966  TRUE
21        Alabama  813479  1266546  795696  1255925 FALSE
22         Alaska  123594   193841  122640   164676 FALSE
23        Arizona 1034707  1230111 1025232  1233654 FALSE

We imported the data without the precalculated two-party vote shares (VS), in order to run the formulas on our end. The first variable codes a state “blue” or “red” based on party victory. The last variable measures the size of the swing that would have been required for a given state to change hands (for Romney to win).

# Obama victory margins, using two-party vote.
dw <- within(dw, {
    State_Color <- ifelse(Obama08 > McCain08, "Blue", "Red")
    # Margin in 2008.
    Total_VS_08 <- Obama08 + McCain08
    Obama_VS_08 <- 100 * Obama08/Total_VS_08
    # Margin in 2012.
    Total_VS_12 <- Obama12 + Romney12
    Obama_VS_12 <- 100 * Obama12/Total_VS_12
    # Obama swing in two-party vote share.
    Obama_Swing <- Obama_VS_12 - Obama_VS_08
    # Swing required for state to change hands.
    Rep_Wins <- 100 * (Romney12 - Obama12)/Total_VS_12
})
# Check results.
str(dw)
'data.frame':   50 obs. of  13 variables:
 $ State      : chr  "Colorado" "Florida" "Iowa" "Michigan" ...
 $ Obama08    : num  1288633 4282074 828940 2872579 1573354 ...
 $ McCain08   : num  1073629 4045624 682379 2048639 1275409 ...
 $ Obama12    : num  1323101 4237756 822544 2564569 1546167 ...
 $ Romney12   : num  1185243 4163447 730617 2115256 1320225 ...
 $ Swing      : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ Rep_Wins   : num  -5.496 -0.885 -5.919 -9.601 -7.882 ...
 $ Obama_Swing: num  -1.803 -0.977 -1.889 -3.571 -1.288 ...
 $ Obama_VS_12: num  52.7 50.4 53 54.8 53.9 ...
 $ Total_VS_12: num  2508344 8401203 1553161 4679825 2866392 ...
 $ Obama_VS_08: num  54.6 51.4 54.8 58.4 55.2 ...
 $ Total_VS_08: num  2362262 8327698 1511319 4921218 2848763 ...
 $ State_Color: chr  "Blue" "Blue" "Blue" "Blue" ...

We now plot a first overview of the data, based on Simon Jackman's analysis of similar figures. The list of swing states and the actual swings are slightly different in his visualization, and only one \(x\)-axis is scaled in our rendering because ggplot2 enforces the Stephen Few recommendation against dual scales.

# Order plot by states.
dw$State <- with(dw, reorder(State, Obama_Swing), ordered = TRUE)
# Dot plot.
ggplot(dw, aes(y = State, x = Obama_Swing)) +
  geom_vline(x = c(0, mean(dw$Obama_Swing)), size = 4, color = "grey95") +
  geom_point(aes(colour = ifelse(Obama08 > McCain08, "Dem", "Rep")), size = 5) +
  geom_point(data = subset(dw, Swing), aes(x = Rep_Wins), size = 5, shape = 1) +
  scale_x_continuous(breaks = -10:4) +
  scale_colour_manual("2008", values = brewer.pal(3, "Set1")[c(2, 1)]) +
  labs(y = NULL, x = NULL, title = "Obama Swing in Two Party Vote Share\n")

plot of chunk dw-dotplot

In the plot above, the first grey line is the average swing in vote share in 2012, and the second one marks the zero-swing point. The black points are the theoretical swing points at which the battleground states would have gone Republican. Read Simon Jackman's analysis for more on the topic.

Let's also replicate Simon Jackman's second plot, showing the swing from 2008 to 2012 against Obama's vote share in 2008 weighted by electoral college votes. The first step for this plot is to get the number of electoral college voters per state as well as state abbreviations, both from Wikipedia. The data are merged to the principal data frame.

# Electoral college votes, 2012.
url = "http://en.wikipedia.org/wiki/Electoral_College_(United_States)"
# Extract fifth table.
college <- readHTMLTable(url, which = 4, stringsAsFactors = FALSE)
# Keep first and last columns, removing total electors.
college <- data.frame(State = college[, 1], College = as.numeric(college[, 35]))
# Merge to main dataset.
dw <- merge(dw, college, by = "State")
# U.S. states codes.
url = "http://en.wikipedia.org/wiki/List_of_U.S._states"
# Extract fifth table.
uscodes <- readHTMLTable(url, which = 1, stringsAsFactors = FALSE)
# Keep first and last columns, removing total electors.
uscodes <- data.frame(State = gsub("\\[[A-Z]+\\]", "", uscodes[, 1]), Abbreviation = uscodes[, 
    4])
# Merge to main dataset.
dw <- merge(dw, uscodes, by = "State")
# Check result.
str(dw)
'data.frame':   50 obs. of  15 variables:
 $ State       : Factor w/ 50 levels "Utah","West Virginia",..: 42 50 43 25 41 27 19 16 36 29 ...
 $ Obama08     : num  813479 123594 1034707 422310 8274473 ...
 $ McCain08    : num  1266546 193841 1230111 638017 5011781 ...
 $ Obama12     : num  795696 122640 1025232 394409 7854285 ...
 $ Romney12    : num  1255925 164676 1233654 647744 4839958 ...
 $ Swing       : logi  FALSE FALSE FALSE FALSE FALSE TRUE ...
 $ Rep_Wins    : num  22.43 14.63 9.23 24.31 -23.75 ...
 $ Obama_Swing : num  -0.325 3.749 -0.299 -1.983 -0.406 ...
 $ Obama_VS_12 : num  38.8 42.7 45.4 37.8 61.9 ...
 $ Total_VS_12 : num  2051621 287316 2258886 1042153 12694243 ...
 $ Obama_VS_08 : num  39.1 38.9 45.7 39.8 62.3 ...
 $ Total_VS_08 : num  2080025 317435 2264818 1060327 13286254 ...
 $ State_Color : chr  "Red" "Red" "Red" "Red" ...
 $ College     : num  9 3 11 6 55 9 7 3 29 16 ...
 $ Abbreviation: Factor w/ 49 levels "Albuquerque",..: 6 2 41 28 29 17 9 49 25 3 ...

The final plot follows. It confirms that Obama won by protecting the battleground states, losing only two states in the overall swing. Also, if you have not started Nate Silver's blog yet, now might be the time, starting with swing voters versus elastic states.

# Swing vs. Vote Share, weighted by Electoral College Votes.
ggplot(dw, aes(y = Obama_Swing, x = Obama_VS_08)) + geom_rect(xmin = 50, xmax = Inf, 
    ymin = -Inf, ymax = Inf, alpha = 0.3, fill = "grey95") + geom_point(aes(color = Romney12 > 
    Obama12, size = College), alpha = 0.6) + geom_text(colour = "white", label = ifelse(dw$Swing, 
    as.character(dw$Abbreviation), NA)) + scale_colour_manual("2008", values = brewer.pal(3, 
    "Set1")[c(2, 1)]) + scale_size_area(max_size = 42) + labs(y = "Obama Swing in Two Party Vote Share", 
    x = "Obama 2008 Vote Share") + theme(legend.position = "none")

plot of chunk dw-jackman-auto

Mapping the electoral swing

To map the swing from 2008 to 2012, we load U.S. geographical data and extract its state names. The method corresponds to the code provided by David Sparks: it takes a map object provided in the maps package and adds the variables of interest to it, using the region variable as the unique identifier for U.S. states.

# Load state shapefile from maps.
states.data <- map("state", plot = FALSE, fill = TRUE)
# Convert shapes to a data frame.
states.data <- fortify(states.data)
# Extract states from data frame.
states.list <- sort(unique(states.data$region))
# Exclude Washington D.C. (sorry).
states.list = states.list[-which(grepl("columbia", states.list))]
# Subset to map states (sorry Alaska).
dw = subset(dw, tolower(State) %in% states.list)
# Transpose data to map dataset.
states.data$SwingBO <- by(dw$Obama_Swing, states.list, mean)[states.data$region]
states.data$Obama08 <- by(dw$Obama_VS_08, states.list, mean)[states.data$region]
states.data$Obama12 <- by(dw$Obama_VS_12, states.list, mean)[states.data$region]

The plot is going to show quintiles of the Obama swing from 2008 to 2012. To make the code shorter, the quintiles are calculated by a short function, and the plots use a common ggplot2 structure. Most of the graph options are set to make the plot blank. The coordinates of the plot are made conic to curve the map correctly.

# Choropleth map function.
ggchoro <- function(x, q = 5, title = NULL) {
  x = states.data[, x]
  states.data$q = cut(x, breaks = quantile(round(x), 
                                           probs = 0:q/q, 
                                           na.rm = TRUE),
                      include.lowest = TRUE)
  ggplot(states.data, 
         aes(x = long, 
             y = lat, 
             group = group, 
             fill = q)) +
    geom_polygon(colour = "white") + 
    coord_map(project = "conic", lat0 = 30) +
    scale_fill_brewer("", palette = "RdYlBu") +
    labs(y = NULL, x = NULL, title = title) +
    theme(panel.border = element_rect(color = "white"), 
          axis.text = element_blank(),
          axis.ticks = element_blank()) 
}
# Choropleth maps.
ggchoro("SwingBO", title = "Swing in the Obama vote share")

plot of chunk dw-map-choropleth-auto

ggchoro("Obama08", title = "Obama vote share, 2008")

plot of chunk dw-map-choropleth-auto

ggchoro("Obama12", title = "Obama vote share, 2012")

plot of chunk dw-map-choropleth-auto

David Sparks has also posted code for simpler maps with less data wrangling, and chloropleth maps with more precise data. In the future, there's a chance that he will also post his code for isarithmic map that look like this:

Isarithmic map, by David Sparks "http://dsparks.wordpress.com/2011/10/24/isarithmic-maps-of-public-opinion-data/"

Next week: Networks.