Introduction to Data Analysis

9. Visualization in time: Time series

Time series are made of observations that are repeated over time, such as the daily stock value of a company or the annual GDP growth rate of a country. It is a very common data format in demography, economics and social science in general. Large amounts of data are available as time series. Try services like Quandl to find and plot series from many different sources.

Let's start our session with some revision about opening and preparing data for analysis, today in time series format. The first example shows how to plot party support in the United Kingdom since 1984 using ICM polling data, as The Guardian did a few years back. It uses a mix of old and new packages.

# Load packages.
packages <- c("ggplot2", "lubridate", "plyr", "reshape2", "RColorBrewer", "RCurl")
packages <- lapply(packages, FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        install.packages(x)
        library(x, character.only = TRUE)
    }
})

Data preparation

The data source is imported from The Guardian Data Blog as a CSV spreadsheet. The download and import technique below was previously introduced in Session 4.1. The first step consists in assigning data targets: the one from which we import, and the one to which we save the data.

# Target data link.
link = "https://docs.google.com/spreadsheet/pub?key=0AonYZs4MzlZbcGhOdG0zTG1EWkVPOEY3OXRmOEIwZmc&output=csv"
# Target data file.
file = "data/icm.polls.8413.csv"

The next code block then quickly checks for the existence of the file on disk, downloads and converts the data to CSV format if it is not found, and opens it with strings (text characters) left as such. The download procedure is handled by the RCurl package through an unsigned SSL connection, and its result requires to be converted from raw to structured text, and then to comma-separated values.

# Download dataset.
if (!file.exists(file)) {
    message("Dowloading the data...")
    # Download and read HTML spreadsheet.
    html <- textConnection(getURL(link, ssl.verifypeer = FALSE))
    # Convert and export CSV spreadsheet.
    write.csv(read.csv(html), file)
}
# Open file.
icm <- read.csv(file, stringsAsFactors = FALSE)
# Check result.
str(icm)
'data.frame':   380 obs. of  9 variables:
 $ X                              : int  1 2 3 4 5 6 7 8 9 10 ...
 $ End.of.fieldwork..election.date: chr  "15-06-1984" "15-07-1984" "15-08-1984" "15-09-1984" ...
 $ CON                            : chr  "37%" "34%" "36%" "39%" ...
 $ LAB                            : chr  "38%" "39%" "39%" "38%" ...
 $ LIB.DEM                        : chr  "23%" "26%" "24%" "21%" ...
 $ OTHER                          : chr  "2%" "1%" "1%" "2%" ...
 $ CON.LEAD.OVER.LABOUR           : chr  "-1%" "-5%" "-3%" "1%" ...
 $ Sample                         : chr  "n/a" "n/a" "n/a" "n/a" ...
 $ Fieldwork.dates                : chr  "June, 1984" "July, 1984" "Aug, 1984" "Sep 1984" ...

As usual, the data structure shows a few issues. The first of them is solved by extracting the voting intentions matrix that forms columns 3-7 of the data frame, to remove percentage symbols from it with the gsub() search-and-replace function, and to return the numeric results to the dataset. The resulting columns will be converted from character to numeric class.

# Clean percentages.
icm[, 3:7] <- as.numeric(gsub("%", "", as.matrix(icm[, 3:7])))
# Check result.
str(icm)
'data.frame':   380 obs. of  9 variables:
 $ X                              : int  1 2 3 4 5 6 7 8 9 10 ...
 $ End.of.fieldwork..election.date: chr  "15-06-1984" "15-07-1984" "15-08-1984" "15-09-1984" ...
 $ CON                            : num  37 34 36 39 38 42 41 41 38 36 ...
 $ LAB                            : num  38 39 39 38 36 33 32 33 36 36 ...
 $ LIB.DEM                        : num  23 26 24 21 24 24 26 25 25 27 ...
 $ OTHER                          : num  2 1 1 2 2 1 1 1 1 1 ...
 $ CON.LEAD.OVER.LABOUR           : num  -1 -5 -3 1 2 9 9 8 2 0 ...
 $ Sample                         : chr  "n/a" "n/a" "n/a" "n/a" ...
 $ Fieldwork.dates                : chr  "June, 1984" "July, 1984" "Aug, 1984" "Sep 1984" ...

A specificity of the data is that general election results have been marked as “GENERAL ELECTION RESULT” in the Sample column (take a look at the original spreadsheet to determine that). We simply extract a logical statement from that information, in order to create a TRUE/FALSE marker called GE for general elections. The result can be check by showing voting intentions on these precise dates.

# Mark general elections.
icm$GE <- grepl("RESULT", icm$Sample)
# Check result.
icm[icm$GE, 2:6]
    End.of.fieldwork..election.date   CON   LAB LIB.DEM OTHER
42                       11-06-1987 43.00 32.00   23.00  2.00
100                      09-04-1992 43.00 35.00   18.00  4.00
165                      01-05-1997 31.40 44.40   17.20  7.00
222                      07-06-2001 32.70 42.00   18.80  6.50
273                      05-05-2005 33.20 36.20   22.60  8.00
339                      06-05-2010 36.45 29.01   23.03 11.95

A specific aspect of the Guardian/ICM dataset is the presence of dates, which can be converted to be recognized as such. The next code block creates a Date variable in the icm dataset that will be used in later plots. The dmy() function of the lubridate package is used to convert the text strings from dd-mm-yyy (day-month-year) format into proper dates, called POSIXct objects.

# Convert dates.
icm$Date <- dmy(icm$End.of.fieldwork..election.date)
Warning: 3 failed to parse.
# Check result.
str(icm)
'data.frame':   380 obs. of  11 variables:
 $ X                              : int  1 2 3 4 5 6 7 8 9 10 ...
 $ End.of.fieldwork..election.date: chr  "15-06-1984" "15-07-1984" "15-08-1984" "15-09-1984" ...
 $ CON                            : num  37 34 36 39 38 42 41 41 38 36 ...
 $ LAB                            : num  38 39 39 38 36 33 32 33 36 36 ...
 $ LIB.DEM                        : num  23 26 24 21 24 24 26 25 25 27 ...
 $ OTHER                          : num  2 1 1 2 2 1 1 1 1 1 ...
 $ CON.LEAD.OVER.LABOUR           : num  -1 -5 -3 1 2 9 9 8 2 0 ...
 $ Sample                         : chr  "n/a" "n/a" "n/a" "n/a" ...
 $ Fieldwork.dates                : chr  "June, 1984" "July, 1984" "Aug, 1984" "Sep 1984" ...
 $ GE                             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ Date                           : POSIXct, format: "1984-06-15" "1984-07-15" ...

This process works like the as.Date() base function shown in Session 4.2, and for which Teetor, ch. 7.9-7.11, is a good starting point. The lubridate package is a convenience tool that deals with date formats in a more flexible way. Extracting the year from the Date variable created above, for instance, can be done effortlessly, just as any other form of date extraction:

# List polling years.
table(year(icm$Date))

1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 
   7   12   12   13   12   12   12   12   16   12   12   12   12   16   12 
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
  12   12   20   12   12   12   16   12   13   12   11   19   12   12    6 
# List general election years.
table(year(icm$Date[icm$GE]))

1987 1992 1997 2001 2005 2010 
   1    1    1    1    1    1 
# List general election months.
table(month(icm$Date[icm$GE]))

4 5 6 
1 3 2 

Our final step is to remove unused information from the data by selecting the date, ge and voting intentions columns to form the finalized icm dataset, which is them reshape to long format in order to write one row of data for each political party. A few missing values that correspond to undated information at the end of the series are removed from the data.

# Subset data.
icm <- icm[, c("Date", "GE", "CON", "LAB", "LIB.DEM", "OTHER")]
# Drop missing data.
icm <- na.omit(icm)
# Reshape dataset.
icm <- melt(icm, id = c("Date", "GE"), variable_name = "Party")
# Check result.
head(icm)
        Date    GE Party value
1 1984-06-15 FALSE   CON    37
2 1984-07-15 FALSE   CON    34
3 1984-08-15 FALSE   CON    36
4 1984-09-15 FALSE   CON    39
5 1984-10-15 FALSE   CON    38
6 1984-11-15 FALSE   CON    42

The operations above leave us with correctly 'timed' data. It becomes very important to know how to work with dates if you frequently analyze quaterly data (as with college enrollments) or with even more fine-grained time series, like stock values (see, e.g., Moritz Marbach's analysis of the relationship between the Frankfurt DAX index and German federal elections).

Plotting time series

Another aspect of the data is that the “party” variable is split over columns and requires a reshape of the dataset to hold a single party variable.

# Check party name order.
levels(icm$Party)
[1] "CON"     "LAB"     "LIB.DEM" "OTHER"  

In the previous block, we finished by checking the levels of the Party variable to assign it specific colors from a tone palette, using a vector of color values of the same length and order. The next code block uses Cynthia Brewer's ColorBrewer palette of discrete colors to find tinits of blue, red, orange, purple that fit each British party formation (purple for “Others” is arbitrary).

# View Set1 from ColorBrewer.
display.brewer.pal(7, "Set1")

plot of chunk icm-palette-1

# View selected color classes.
brewer.pal(7, "Set1")[c(2, 1, 5, 4)]
[1] "#377EB8" "#E41A1C" "#FF7F00" "#984EA3"

The selection of colors can be passed to a ggplot2 graph object through its option for manual scales. That option can be itself stored into a one-word object that we can quickly assign to any number of graphs in what follows. We create similar objects to colorize fills as well as lines and to drop titles on axes while giving a title to the overall graph.

# ggplot2 manual color palette.
colors <- scale_colour_manual(values = brewer.pal(7, "Set1")[c(2, 1, 5, 4)])
# ggplot2 manual fill color palette.
fcolors <- scale_fill_manual(values = brewer.pal(7, "Set1")[c(2, 1, 5, 4)])
# ggplot2 option to set titles.
titles <- labs(title = "Guardian/ICM voting intentions polls\n", y = NULL, x = NULL)

The ICM polling data can now be plotted as a time series of estimated party support, using the Party variable and the ukcolors option to determine the color of each series. We start with a ggtplot2 object that uses a line geometry to connect observations over time, passing the graph options defined previously. The line break \n at the end of the title adds a margin.

# Time series lines.
qplot(data = icm, y = value, x = Date, color = Party, geom = "line") + colors + 
    titles

plot of chunk icm-lines-auto

A more detailed visualization shows the actual data points at a reduced, factorized size I(.75), along with a smoothed trend of each series. The smooth geometry will select its own method for smoothing the series, namely a LOESS estimator. The se option to show its confidence interval is dropped, and the same graph options as before are passed for consistency.

# Scatterplot.
qplot(data = icm, y = value, x = Date, color = Party, size = I(0.75), geom = "point") + 
    geom_smooth(se = FALSE) + colors + titles

plot of chunk icm-points-auto

Another visualization consists in stacking the entire series and showing it as an area of a common space representing 100% of voting intentions. Be careful with interpretation here: the electorate changes from an election to another, and voting intentions are not votes. Slight errors appear at the top of the graph due to rounding approximations in the polling data.

# Stacked area plot.
qplot(data = icm, y = value, x = Date, color = Party, fill = Party, stat = "identity", 
    geom = "area") + colors + fcolors + titles

plot of chunk icm-stacked-auto

The graph possibly contains excess information by assuming that voting intentions are volatile enough to express meaningful monthly variations. To show the same pattern with less data, we aggregate the data by averaging over years. The year() function from the lubridate package and the ddply() function from the plyr package show one possible way to achieve this result.

# Stacked bar plot.
qplot(data = ddply(icm, .(Year = year(Date), Party), summarise, value = mean(value)), 
    fill = Party, color = Party, x = Year, y = value, stat = "identity", geom = "bar") + 
    colors + fcolors + titles

plot of chunk icm-bars-auto

Adding annotations

You might also want to take advantage of the general elections results alone to plot actual vote shares. The slightly more complex line plot below shows them by subsetting the data to GE observations, by extracting their year with the year() function of the lubridate package, by plotting a white dot where they are located, and by overimposing the year in small text above that white space.

# Plotting only general elections.
qplot(data = icm[icm$GE, ], y = value, x = Date, color = Party, size = I(0.75), 
    geom = "line") + geom_point(size = 12, color = "white") + geom_text(aes(label = year(Date)), 
    size = 4) + colors + titles

plot of chunk icm-ges-auto

Now that we have some idea of how to represent time series visually, let's turn to the properties of time series that can be put into perspective with a bit of statistical analysis. The first one deals with temporal dependence in time series. The second one returns to graphs to explain how smoothed trends are produced. The final practice exercise shows how to model panel data.

Next: Autocorrelation.