Introduction to Data Analysis

4.1. Imports and exports

This section introduces data import and export (or “I/O”, for “input/output”) with R.

# Load packages.
packages <- c("countrycode", "downloader", "foreign", "ggplot2", "plyr", "RCurl", 
    "XML")
packages <- lapply(packages, FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        install.packages(x)
        library(x, character.only = TRUE)
    }
})

Downloading a Google Spreadsheet

We are going to use the Daily Kos Elections' presidential results by congressional district for the 2012 and 2008 elections, for which the data is accessible as a Google spreadsheet. The Google Docs API documentation tells us that we can get the data in CSV format through a URL request that includes the identifier key of the spreadsheet and the format specification output=csv.

# Create a filename for the dataset.
file = "data/dailykos.votes.0812.csv"
# Store the address of the spreadsheet.
link = "https://docs.google.com/spreadsheet/pub?key=0Av8O-dN2giY6dEFCOFZ4ZnlKS0x3M3Y0WHd5aWFDWkE&output=csv"

We now need to get the data from that address, using the getURL command from the RCurl package to fetch the online spreadsheet from the dk.link object in which we stored the link. The ssl.verifypeer option is required to avoid an issue with the SSL certification used by HTTPS links.

Note that the getURL() command is in a conditional statement that avoids downloading the same file again and again if you already have it.

When the file is fetched from online, we convert the result, which is a large text file, to a proper CSV (comma-separated values) file. We specify that we do not want strings converted to factors, i.e. that we do not want a numeric structure for the text variables.

# Download dataset.
if (!file.exists(file)) {
    message("Dowloading the data...")
    # Download and read HTML spreadsheet.
    html <- textConnection(getURL(link, ssl.verifypeer = FALSE))
    # Convert and export CSV spreadsheet.
    write.csv(read.csv(html), file)
}
# Open file.
dkos <- read.csv(file, stringsAsFactors = FALSE)
# Check result.
str(dkos)
'data.frame':   435 obs. of  8 variables:
 $ X          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ CD         : chr  "AK-AL" "AL-01" "AL-02" "AL-03" ...
 $ Incumbent  : chr  "Young, Don" "Bonner, Jo" "Roby, Martha" "Rogers, Mike D." ...
 $ Party      : chr  "(R)" "(R)" "(R)" "(R)" ...
 $ Obama.2012 : num  41.2 37.4 36.4 36.8 24 34.9 24.7 72.4 36.3 42.9 ...
 $ Romney.2012: num  55.3 61.8 62.9 62.3 74.8 63.9 74.3 27.1 61 54.7 ...
 $ Obama.2008 : num  38.1 38.5 35 36.6 25.5 36.3 25 71.5 39.2 44.3 ...
 $ McCain.2008: num  59.7 60.9 64.5 62.6 73.3 62.6 74.1 28.1 58 53.8 ...

We finally inspect the result by looking at the structure of the dataset with str and the first few rows of data with head. The plot shows the [regression line][ggplot2-abline] [two-dimensional density curves][ggplot2-density2d] of the Obama vote share in congressional districts held by Democrat and Republican congressmen in 2008 and 2012.

# List first data rows.
head(dkos)
  X    CD       Incumbent Party Obama.2012 Romney.2012 Obama.2008
1 1 AK-AL      Young, Don   (R)       41.2        55.3       38.1
2 2 AL-01      Bonner, Jo   (R)       37.4        61.8       38.5
3 3 AL-02    Roby, Martha   (R)       36.4        62.9       35.0
4 4 AL-03 Rogers, Mike D.   (R)       36.8        62.3       36.6
5 5 AL-04   Aderholt, Rob   (R)       24.0        74.8       25.5
6 6 AL-05      Brooks, Mo   (R)       34.9        63.9       36.3
  McCain.2008
1        59.7
2        60.9
3        64.5
4        62.6
5        73.3
6        62.6
# Plot data points with regression line and density curves.
qplot(data = dkos, y = Obama.2012, x = Obama.2008, 
      colour = Party, size = I(2), geom = "point") + 
  geom_abline(alpha = .5) + 
  geom_density2d() +
  scale_x_continuous(lim = c(0, 100)) + 
  scale_y_continuous(lim = c(0, 100))

plot of chunk dk-plot-auto

The overall trend is clearly interpretable, and you can easily spot some “safe states” for each party, as well as some “swing states”. There are a few exceptions, like Joseph Cao, the sole Republican to vote with the Democrats on Barack Obama's health care reform (although he opposed the final version of the bill over concerns about abortion).

Downloading and converting files

This sections shows how to download the Quality of Government (QOG) dataset, which we will use in future sessions. QOG datasets feature a wide range of variables from various sources, including UN data, World Development Indicators and several datasets compiled by academic researchers. We are going to use the QOG Standard dataset.

The first code block will start by downloading the QOG codebook if it is not found in the data folder. Install the downloader package prior to running the code (by now, you should know how to install a package). Note that we first store the target locations for the data, that is, the link to its online location, and its desired download location on disk.

# Target locations.
link = "http://www.qogdata.pol.gu.se/data/Codebook_QoG_Std15May13.pdf"
file = "data/qog.codebook.pdf"
# Download Quality of Government Standard codebook.
if (!file.exists(file)) download(link, file, mode = "wb")

The next code block checks whether you have the comma-separated values version of the data. If not, it downloads the QOG Standard cross-sectional dataset in Stata format, opens it using the read.dta() function from the foreign library, and converts it from that source. The Stata format has the advantage of holding variable and value labels; we will keep it as a backup.

# Extract Quality of Government Standard cross-sectional data from a ZIP
# archive.
zip = "data/qog.cs.zip"
qog = "data/qog.cs.csv"
if (!file.exists(zip)) {
    dta = "data/qog.cs.dta"
    download("http://www.qogdata.pol.gu.se/data/qog_std_cs.dta", dta, mode = "wb")
    write.csv(read.dta(dta, warn.missing.labels = FALSE), qog)
    zip(zip, file = c(dta, qog))
    file.remove(dta, qog)
}
qog = read.csv(unz(zip, qog), stringsAsFactors = FALSE)

'Cross-sectional' means that the data has only one observation per unit of analysis, which are countries in this case. 'Cross-sectional time series' would hold the same data, but for repeated observations over time, such as the fertility rate of each country in 1970, 1971, and so on. Consult the QOG codebook for more details the data.

The concluding example below uses data from economic historians Barro and Lee and Maddison. Another advantage of Quality of Government data is that it comes with several country identifiers that allow to match external data to it, as we do below by adding continental origin based on ISO-3C country codes with the countrycode package.

# Add geographic continents using UN country codes.
qog$continent = factor(countrycode(qog$ccodealp, "iso3c", "continent"))
# Plot log-GDP/capita and female education, weighted by population (2002).
qplot(data = qog, y = log(wdi_gdpc), x = bl_asy25f, 
      colour = continent, size = mad_pop / 10^3, geom = "point") +
  scale_colour_brewer("Continent\n", palette = "Set1") +
  scale_size_area("Population\n", max_size = 24) + 
  labs(y = "log GDP/capita", x = "Female schooling years")

plot of chunk qog-plot-auto

This plot would be much more informative as an animated graph; we will come back to this.

Scraping

For our purposes, scraping is the idea of taking information from online sources and to process it into a dataset. Zarino Zappia, the creator of the ScraperWiki website, has written great examples of such scrapers in the Python programming language, as with his scraper and chronological map of Occupy protest locations, using Wikipedia sources.

R can handle scraping of XML and HTML content with the XML package. Once you know how to do that, you can scrape many different contents, like Craigslist, and experiment with what the authors of that example have termed data sociology (note: the links are written in French by two friends).

Here's a quick example: go the World Chess Federation's website, which holds ratings for several thousands of registered chess players. Search for GrandMaster players, and familiarize yourself with the table that the search returns (the variables are explained at the bottom). We are going to save these tables into a dedicated folder.

files = "data/fide"
if (!file.exists(files)) dir.create(files)

Notice how the table is split over several pages that share almost the same address (URL), which can be defined as the url object below. The only parameter that changes from one page to the other is the final offset value, which starts at 0 and ends at 1400 by increments of 100 (the number of players shown on each page). We will therefore create a sequence to fit that presentation.

# Link to each table.
url = "http://ratings.fide.com/advaction.phtml?title=g&offset="
# Link parameter.
i <- seq(0, 1400, 100)

We are now going to scrape that table from the HTML code of the web pages. The next code block retrieves that code and applies an XPath expression to select the second table of class contentpaneopen on each page, which corresponds to the players table in the HTML source code. We subset the table by dropping a few rows and one column, and then save it to CSV format.

Note that the next code block can take a little while to run, depending on your processor speed and bandwidth capacity. Our own tests on modern equipment took a maximum of one minute.

# Scraper function.
fide <- sapply(i, FUN = function(x) {
    # Define filename.
    file = paste0(files, "/fide.table.", x, ".csv")
    # Scrape if needed.
    if (!file.exists(file)) {
        message("Downloading data to ", file)
        # Parse HTML.
        html <- htmlParse(paste0(url, x))
        # Select second table.
        html <- xpathApply(html, "//table[@class='contentpaneopen']")[[2]]
        # Import table.
        data <- readHTMLTable(html, skip.rows = 1:3, header = TRUE)[, -1]
        # Save as CSV.
        write.csv(data, file, row.names = FALSE)
    } else {
        message("Skipping table #", x)
    }
    return(file)
})
# Zip archive.
zip("data/fide.zip", fide)
# Delete workfiles.
message(fide)
message(files)
file.remove(fide, files)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[15] TRUE TRUE

The code above is verbose: it will tell you when it skips existing files, and it will tell you when it starts downloading one. If you already have all data tables in the fide folder of your data folder, try deleting one file from it and run the code again to see how it works. When you are done running the code, you will need to assemble all tables together, which is what we do below by reading them from the ZIP archive.

# Import tables into a list.
fide <- lapply(fide, function(x) {
    read.csv(unz("data/fide.zip", x))
})
# Convert list to data frame.
fide <- rbind.fill(fide)
# Remove rows with no player.
fide <- fide[!is.na(fide$Name), ]
# Check result.
tail(fide)
                       Name  T WT Oth..T. Fed  Rtg  Rpd  Blz B.Year S  F
1453 Zlochevskij, Alexander GM            ITA 2489 2485 2368   1963 M NA
1454             Zoler, Dan GM            ISR 2509 2548 2536   1974 M NA
1455        Zontakh, Andrey GM            UKR 2533   NA   NA   1970 M NA
1456     Zubarev, Alexander GM            UKR 2553 2524 2573   1979 M NA
1457       Zubov, Alexander GM            UKR 2582 2630 2632   1983 M NA
1458      Zvjaginsev, Vadim GM            RUS 2653 2630 2604   1976 M NA

Note the use of the rbind.fill() function to efficiently collapse the list of dataframes contained in the object producted by the lapply() function, which itself reads through all fide filenames in the fide data folder. Finally, here's the distribution of FIDE ratings by sex and birth cohort, based on GrandMaster players.

# Determine birth cohort (decades).
fide$cohort <- cut(fide$B.Year, 
              breaks = seq(1919, 1999, 10), 
              labels = seq(1920, 1999, 10))
# Extract numeric rating substring.
fide$Rtg <- as.numeric(substr(fide$Rtg, 0, 4))
# Plot ratings over age and sex.
qplot(data = subset(fide, !is.na(cohort)), 
      x = cohort, y = Rtg, fill = S, alpha = I(.5),
      geom = "boxplot") +
    scale_fill_brewer("Sex\n", palette = "Set1") +
    labs(x = "Birth cohort", y = "FIDE standard rating")

plot of chunk fide-plot-auto

You should now be able to determine how far chess is in its demographic transition towards “player parity”, and whether that turn is affecting the overall standard ratings of FIDE members. A student once had the great idea to throw in these ratings with economic performance variables: check out the results!

Next: Reshapes and aggregates.