This section introduces data import and export (or “I/O”, for “input/output”) with R.
# Load packages.
packages <- c("countrycode", "downloader", "foreign", "ggplot2", "plyr", "RCurl",
"XML")
packages <- lapply(packages, FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x)
library(x, character.only = TRUE)
}
})
We are going to use the Daily Kos Elections' presidential results by congressional district for the 2012 and 2008 elections, for which the data is accessible as a Google spreadsheet. The Google Docs API documentation tells us that we can get the data in CSV format through a URL request that includes the identifier key of the spreadsheet and the format specification output=csv
.
# Create a filename for the dataset.
file = "data/dailykos.votes.0812.csv"
# Store the address of the spreadsheet.
link = "https://docs.google.com/spreadsheet/pub?key=0Av8O-dN2giY6dEFCOFZ4ZnlKS0x3M3Y0WHd5aWFDWkE&output=csv"
We now need to get the data from that address, using the getURL
command from the RCurl
package to fetch the online spreadsheet from the dk.link
object in which we stored the link. The ssl.verifypeer
option is required to avoid an issue with the SSL certification used by HTTPS
links.
Note that the getURL()
command is in a conditional statement that avoids downloading the same file again and again if you already have it.
When the file is fetched from online, we convert the result, which is a large text file, to a proper CSV (comma-separated values) file. We specify that we do not want strings converted to factors, i.e. that we do not want a numeric structure for the text variables.
# Download dataset.
if (!file.exists(file)) {
message("Dowloading the data...")
# Download and read HTML spreadsheet.
html <- textConnection(getURL(link, ssl.verifypeer = FALSE))
# Convert and export CSV spreadsheet.
write.csv(read.csv(html), file)
}
# Open file.
dkos <- read.csv(file, stringsAsFactors = FALSE)
# Check result.
str(dkos)
'data.frame': 435 obs. of 8 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ CD : chr "AK-AL" "AL-01" "AL-02" "AL-03" ...
$ Incumbent : chr "Young, Don" "Bonner, Jo" "Roby, Martha" "Rogers, Mike D." ...
$ Party : chr "(R)" "(R)" "(R)" "(R)" ...
$ Obama.2012 : num 41.2 37.4 36.4 36.8 24 34.9 24.7 72.4 36.3 42.9 ...
$ Romney.2012: num 55.3 61.8 62.9 62.3 74.8 63.9 74.3 27.1 61 54.7 ...
$ Obama.2008 : num 38.1 38.5 35 36.6 25.5 36.3 25 71.5 39.2 44.3 ...
$ McCain.2008: num 59.7 60.9 64.5 62.6 73.3 62.6 74.1 28.1 58 53.8 ...
We finally inspect the result by looking at the structure of the dataset with str
and the first few rows of data with head
. The plot shows the [regression line][ggplot2-abline] [two-dimensional density curves][ggplot2-density2d] of the Obama vote share in congressional districts held by Democrat and Republican congressmen in 2008 and 2012.
# List first data rows.
head(dkos)
X CD Incumbent Party Obama.2012 Romney.2012 Obama.2008
1 1 AK-AL Young, Don (R) 41.2 55.3 38.1
2 2 AL-01 Bonner, Jo (R) 37.4 61.8 38.5
3 3 AL-02 Roby, Martha (R) 36.4 62.9 35.0
4 4 AL-03 Rogers, Mike D. (R) 36.8 62.3 36.6
5 5 AL-04 Aderholt, Rob (R) 24.0 74.8 25.5
6 6 AL-05 Brooks, Mo (R) 34.9 63.9 36.3
McCain.2008
1 59.7
2 60.9
3 64.5
4 62.6
5 73.3
6 62.6
# Plot data points with regression line and density curves.
qplot(data = dkos, y = Obama.2012, x = Obama.2008,
colour = Party, size = I(2), geom = "point") +
geom_abline(alpha = .5) +
geom_density2d() +
scale_x_continuous(lim = c(0, 100)) +
scale_y_continuous(lim = c(0, 100))
The overall trend is clearly interpretable, and you can easily spot some “safe states” for each party, as well as some “swing states”. There are a few exceptions, like Joseph Cao, the sole Republican to vote with the Democrats on Barack Obama's health care reform (although he opposed the final version of the bill over concerns about abortion).
This sections shows how to download the Quality of Government (QOG) dataset, which we will use in future sessions. QOG datasets feature a wide range of variables from various sources, including UN data, World Development Indicators and several datasets compiled by academic researchers. We are going to use the QOG Standard dataset.
The first code block will start by downloading the QOG codebook if it is not found in the data
folder. Install the downloader
package prior to running the code (by now, you should know how to install a package). Note that we first store the target locations for the data, that is, the link to its online location, and its desired download location on disk.
# Target locations.
link = "http://www.qogdata.pol.gu.se/data/Codebook_QoG_Std15May13.pdf"
file = "data/qog.codebook.pdf"
# Download Quality of Government Standard codebook.
if (!file.exists(file)) download(link, file, mode = "wb")
The next code block checks whether you have the comma-separated values version of the data. If not, it downloads the QOG Standard cross-sectional dataset in Stata format, opens it using the read.dta()
function from the foreign
library, and converts it from that source. The Stata format has the advantage of holding variable and value labels; we will keep it as a backup.
# Extract Quality of Government Standard cross-sectional data from a ZIP
# archive.
zip = "data/qog.cs.zip"
qog = "data/qog.cs.csv"
if (!file.exists(zip)) {
dta = "data/qog.cs.dta"
download("http://www.qogdata.pol.gu.se/data/qog_std_cs.dta", dta, mode = "wb")
write.csv(read.dta(dta, warn.missing.labels = FALSE), qog)
zip(zip, file = c(dta, qog))
file.remove(dta, qog)
}
qog = read.csv(unz(zip, qog), stringsAsFactors = FALSE)
'Cross-sectional' means that the data has only one observation per unit of analysis, which are countries in this case. 'Cross-sectional time series' would hold the same data, but for repeated observations over time, such as the fertility rate of each country in 1970, 1971, and so on. Consult the QOG codebook for more details the data.
The concluding example below uses data from economic historians Barro and Lee and Maddison. Another advantage of Quality of Government data is that it comes with several country identifiers that allow to match external data to it, as we do below by adding continental origin based on ISO-3C country codes with the countrycode
package.
# Add geographic continents using UN country codes.
qog$continent = factor(countrycode(qog$ccodealp, "iso3c", "continent"))
# Plot log-GDP/capita and female education, weighted by population (2002).
qplot(data = qog, y = log(wdi_gdpc), x = bl_asy25f,
colour = continent, size = mad_pop / 10^3, geom = "point") +
scale_colour_brewer("Continent\n", palette = "Set1") +
scale_size_area("Population\n", max_size = 24) +
labs(y = "log GDP/capita", x = "Female schooling years")
This plot would be much more informative as an animated graph; we will come back to this.
For our purposes, scraping is the idea of taking information from online sources and to process it into a dataset. Zarino Zappia, the creator of the ScraperWiki website, has written great examples of such scrapers in the Python programming language, as with his scraper and chronological map of Occupy protest locations, using Wikipedia sources.
R can handle scraping of XML and HTML content with the XML
package. Once you know how to do that, you can scrape many different contents, like Craigslist, and experiment with what the authors of that example have termed data sociology (note: the links are written in French by two friends).
Here's a quick example: go the World Chess Federation's website, which holds ratings for several thousands of registered chess players. Search for GrandMaster players, and familiarize yourself with the table that the search returns (the variables are explained at the bottom). We are going to save these tables into a dedicated folder.
files = "data/fide"
if (!file.exists(files)) dir.create(files)
Notice how the table is split over several pages that share almost the same address (URL), which can be defined as the url
object below. The only parameter that changes from one page to the other is the final offset
value, which starts at 0
and ends at 1400
by increments of 100 (the number of players shown on each page). We will therefore create a sequence to fit that presentation.
# Link to each table.
url = "http://ratings.fide.com/advaction.phtml?title=g&offset="
# Link parameter.
i <- seq(0, 1400, 100)
We are now going to scrape that table from the HTML code of the web pages. The next code block retrieves that code and applies an XPath expression to select the second table of class contentpaneopen
on each page, which corresponds to the players table in the HTML source code. We subset the table by dropping a few rows and one column, and then save it to CSV format.
Note that the next code block can take a little while to run, depending on your processor speed and bandwidth capacity. Our own tests on modern equipment took a maximum of one minute.
# Scraper function.
fide <- sapply(i, FUN = function(x) {
# Define filename.
file = paste0(files, "/fide.table.", x, ".csv")
# Scrape if needed.
if (!file.exists(file)) {
message("Downloading data to ", file)
# Parse HTML.
html <- htmlParse(paste0(url, x))
# Select second table.
html <- xpathApply(html, "//table[@class='contentpaneopen']")[[2]]
# Import table.
data <- readHTMLTable(html, skip.rows = 1:3, header = TRUE)[, -1]
# Save as CSV.
write.csv(data, file, row.names = FALSE)
} else {
message("Skipping table #", x)
}
return(file)
})
# Zip archive.
zip("data/fide.zip", fide)
# Delete workfiles.
message(fide)
message(files)
file.remove(fide, files)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[15] TRUE TRUE
The code above is verbose: it will tell you when it skips existing files, and it will tell you when it starts downloading one. If you already have all data tables in the fide
folder of your data
folder, try deleting one file from it and run the code again to see how it works. When you are done running the code, you will need to assemble all tables together, which is what we do below by reading them from the ZIP archive.
# Import tables into a list.
fide <- lapply(fide, function(x) {
read.csv(unz("data/fide.zip", x))
})
# Convert list to data frame.
fide <- rbind.fill(fide)
# Remove rows with no player.
fide <- fide[!is.na(fide$Name), ]
# Check result.
tail(fide)
Name T WT Oth..T. Fed Rtg Rpd Blz B.Year S F
1453 Zlochevskij, Alexander GM ITA 2489 2485 2368 1963 M NA
1454 Zoler, Dan GM ISR 2509 2548 2536 1974 M NA
1455 Zontakh, Andrey GM UKR 2533 NA NA 1970 M NA
1456 Zubarev, Alexander GM UKR 2553 2524 2573 1979 M NA
1457 Zubov, Alexander GM UKR 2582 2630 2632 1983 M NA
1458 Zvjaginsev, Vadim GM RUS 2653 2630 2604 1976 M NA
Note the use of the rbind.fill()
function to efficiently collapse the list of dataframes contained in the object producted by the lapply()
function, which itself reads through all fide
filenames in the fide
data folder. Finally, here's the distribution of FIDE ratings by sex and birth cohort, based on GrandMaster players.
# Determine birth cohort (decades).
fide$cohort <- cut(fide$B.Year,
breaks = seq(1919, 1999, 10),
labels = seq(1920, 1999, 10))
# Extract numeric rating substring.
fide$Rtg <- as.numeric(substr(fide$Rtg, 0, 4))
# Plot ratings over age and sex.
qplot(data = subset(fide, !is.na(cohort)),
x = cohort, y = Rtg, fill = S, alpha = I(.5),
geom = "boxplot") +
scale_fill_brewer("Sex\n", palette = "Set1") +
labs(x = "Birth cohort", y = "FIDE standard rating")
You should now be able to determine how far chess is in its demographic transition towards “player parity”, and whether that turn is affecting the overall standard ratings of FIDE members. A student once had the great idea to throw in these ratings with economic performance variables: check out the results!
Next: Reshapes and aggregates.