Introduction to Data Analysis

4. Basic data operators

Today's session will show how to open a variety of datasets from different online sources, and to make sure that you know how to convert and import data to prepare it for analysis. The next code block introduces a standard way to check that you have the right packages installed for this page's code: you will find identical blocks on every other course page.

# Load packages.
packages <- c("ggplot2", "WDI")
packages <- lapply(packages, FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        install.packages(x)
        library(x, character.only = TRUE)
    }
})

Getting data from within R

A few years ago, the Gapminder initiative used elegant motion charts to call for the liberation of UN data. Such calls for open data have been met by limited but tangible initiatives to put data online, with specific attention to open access formats and programming facilities (API).

Today, there is a growing ecology of online data repositories and data APIs for R: have a look, for instance, at CKAN, at Data Market and its API, or at Quandl and its R package, at FAOSTAT, at the onlineData tag at CRANtastic

As a means of introduction, let's take a look at the World Bank Indicators, using the [dedicated WDI package][gh-wdi]. The package comes with good documentation and the WDIsearch() function, which you can use to look for indicators straight from R. The example below will download central government debt in percent of gross domestic product for a few high income countries.

# Get WDI data.
wdi <- WDI(country = c("US", "GB", "DE", "FR", "GR"), indicator = "GC.DOD.TOTL.GD.ZS", 
    start = 2005, end = 2011, extra = TRUE, cache = NULL)
# Check result.
str(wdi)
'data.frame':   35 obs. of  11 variables:
 $ iso2c            : chr  "DE" "DE" "DE" "DE" ...
 $ country          : chr  "Germany" "Germany" "Germany" "Germany" ...
 $ GC.DOD.TOTL.GD.ZS: num  55.3 55.6 47.6 43.1 40.8 ...
 $ year             : num  2011 2010 2009 2008 2007 ...
 $ iso3c            : Factor w/ 248 levels "ABW","AFG","AGO",..: 53 53 53 53 53 53 53 73 73 73 ...
 $ region           : Factor w/ 8 levels "Aggregates","East Asia & Pacific (all income levels)",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ capital          : Factor w/ 211 levels "","Abu Dhabi",..: 31 31 31 31 31 31 31 144 144 144 ...
 $ longitude        : Factor w/ 211 levels "","-0.126236",..: 94 94 94 94 94 94 94 126 126 126 ...
 $ latitude         : Factor w/ 211 levels "","-0.229498",..: 182 182 182 182 182 182 182 169 169 169 ...
 $ income           : Factor w/ 7 levels "Aggregates","High income: nonOECD",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ lending          : Factor w/ 5 levels "Aggregates","Blend",..: 5 5 5 5 5 5 5 5 5 5 ...

The data can be used as such to draw a plot of central government debt in a few countries over the past few years. The ggplot2 package provides a wealth of options to build, color and annotate such plots: the example below shows how to plot smoothed trends of governmental central debt over time for each country in the data, using a custom ColorBrewer scheme to color the lines.

# Smoothed time series plot.
g = qplot(data = wdi, x = year, y = GC.DOD.TOTL.GD.ZS,
          colour = country, se = FALSE, geom = c("smooth", "point")) +
  scale_colour_brewer("Country", palette = "Set1") +
  labs(title = "Central government debt, total (% GDP)\n", y = NULL, x = NULL)
# View result.
g

plot of chunk WDI-plot-1-auto

The ggplot2 syntax used in the example above is easily adaptable to create other plots, as you can just add new graphical elements to it. The next code block picks up the g graph object and adds countries as text labels at the end of the time series, rather than as a separate legend. More tweaking is needed in the scales and theme options.

g + geom_text(data = subset(wdi, year == 2011),
              aes(x = 2011.25, y = GC.DOD.TOTL.GD.ZS, label = country), 
              hjust = 0) +
  scale_x_continuous(lim = c(2005, 2012.5)) +
  theme(legend.position = "none", panel.grid.minor = element_blank())

plot of chunk wdi-plot-2-auto

Reading and saving plain text

To replicate these figures, you will need to save its data into a data table. The standard, universally readable format is comma-separated values (CSV), which can be saved with a plain text (TXT) file extension. Our steps to downloading data will often involve saving both the original data source and a “local” plain text copy in this format, to minimize reliance on proprietary formats.

# Target file location.
file = "data/wdi.govdebt.0511.csv"
# Export CSV file.
write.csv(wdi, file, row.names = FALSE)
# Read CSV file again.
wdi <- read.csv(file)

Comma-separated values will be our standard, so that we can always use the read.csv() function to read our data. Note that the row.names = FALSE option will avoid saving the generally useless row numbers into the first column of the file, and that we could also use the read.table() function with the sep = "," and header = TRUE options to import CSV data with variable names on top:

# Alternative CSV import.
wdi <- read.table(file, sep = ",", header = TRUE)
# Check result.
str(wdi)
'data.frame':   35 obs. of  11 variables:
 $ iso2c            : Factor w/ 5 levels "DE","FR","GB",..: 1 1 1 1 1 1 1 2 2 2 ...
 $ country          : Factor w/ 5 levels "France","Germany",..: 2 2 2 2 2 2 2 1 1 1 ...
 $ GC.DOD.TOTL.GD.ZS: num  55.3 55.6 47.6 43.1 40.8 ...
 $ year             : int  2011 2010 2009 2008 2007 2006 2005 2011 2010 2009 ...
 $ iso3c            : Factor w/ 5 levels "DEU","FRA","GBR",..: 1 1 1 1 1 1 1 2 2 2 ...
 $ region           : Factor w/ 2 levels "Europe & Central Asia (all income levels)",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ capital          : Factor w/ 5 levels "Athens","Berlin",..: 2 2 2 2 2 2 2 4 4 4 ...
 $ longitude        : num  13.4 13.4 13.4 13.4 13.4 ...
 $ latitude         : num  52.5 52.5 52.5 52.5 52.5 ...
 $ income           : Factor w/ 1 level "High income: OECD": 1 1 1 1 1 1 1 1 1 1 ...
 $ lending          : Factor w/ 1 level "Not classified": 1 1 1 1 1 1 1 1 1 1 ...

Note, finally, that by default, R will apply factors to columns with character data on import, which means that we will often use read.csv() with the stringsAsFactors = FALSE option when we want to import raw text data. There are many more options to explore, like UTF-8 encoding or using quotes to enclose the values, but our data I/O routine handles these by default.

In the case of the WDI data that we fetched here, saving was pretty straightforward because the data came through a specific API. In real life, importing, converting and preparing data for analysis can be much more messy, so we will now take the time to see many different ways to get data in and out of R, from plain text tables to Microsoft Excel spreadsheets and other common tabular formats.

Next: Import/Export.