Time series are made of observations that are repeated over time, such as the daily stock value of a company or the annual GDP growth rate of a country. It is a very common data format in demography, economics and social science in general. Large amounts of data are available as time series. Try services like Quandl to find and plot series from many different sources.
Let's start our session with some revision about opening and preparing data for analysis, today in time series format. The first example shows how to plot party support in the United Kingdom since 1984 using ICM polling data, as The Guardian did a few years back. It uses a mix of old and new packages.
# Load packages.
packages <- c("ggplot2", "lubridate", "plyr", "reshape2", "RColorBrewer", "RCurl")
packages <- lapply(packages, FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x)
library(x, character.only = TRUE)
}
})
The data source is imported from The Guardian Data Blog as a CSV spreadsheet. The download and import technique below was previously introduced in Session 4.1. The first step consists in assigning data targets: the one from which we import, and the one to which we save the data.
# Target data link.
link = "https://docs.google.com/spreadsheet/pub?key=0AonYZs4MzlZbcGhOdG0zTG1EWkVPOEY3OXRmOEIwZmc&output=csv"
# Target data file.
file = "data/icm.polls.8413.csv"
The next code block then quickly checks for the existence of the file on disk, downloads and converts the data to CSV format if it is not found, and opens it with strings (text characters) left as such. The download procedure is handled by the RCurl
package through an unsigned SSL connection, and its result requires to be converted from raw to structured text, and then to comma-separated values.
# Download dataset.
if (!file.exists(file)) {
message("Dowloading the data...")
# Download and read HTML spreadsheet.
html <- textConnection(getURL(link, ssl.verifypeer = FALSE))
# Convert and export CSV spreadsheet.
write.csv(read.csv(html), file)
}
# Open file.
icm <- read.csv(file, stringsAsFactors = FALSE)
# Check result.
str(icm)
'data.frame': 380 obs. of 9 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ End.of.fieldwork..election.date: chr "15-06-1984" "15-07-1984" "15-08-1984" "15-09-1984" ...
$ CON : chr "37%" "34%" "36%" "39%" ...
$ LAB : chr "38%" "39%" "39%" "38%" ...
$ LIB.DEM : chr "23%" "26%" "24%" "21%" ...
$ OTHER : chr "2%" "1%" "1%" "2%" ...
$ CON.LEAD.OVER.LABOUR : chr "-1%" "-5%" "-3%" "1%" ...
$ Sample : chr "n/a" "n/a" "n/a" "n/a" ...
$ Fieldwork.dates : chr "June, 1984" "July, 1984" "Aug, 1984" "Sep 1984" ...
As usual, the data structure shows a few issues. The first of them is solved by extracting the voting intentions matrix that forms columns 3-7 of the data frame, to remove percentage symbols from it with the gsub()
search-and-replace function, and to return the numeric results to the dataset. The resulting columns will be converted from character
to numeric
class.
# Clean percentages.
icm[, 3:7] <- as.numeric(gsub("%", "", as.matrix(icm[, 3:7])))
# Check result.
str(icm)
'data.frame': 380 obs. of 9 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ End.of.fieldwork..election.date: chr "15-06-1984" "15-07-1984" "15-08-1984" "15-09-1984" ...
$ CON : num 37 34 36 39 38 42 41 41 38 36 ...
$ LAB : num 38 39 39 38 36 33 32 33 36 36 ...
$ LIB.DEM : num 23 26 24 21 24 24 26 25 25 27 ...
$ OTHER : num 2 1 1 2 2 1 1 1 1 1 ...
$ CON.LEAD.OVER.LABOUR : num -1 -5 -3 1 2 9 9 8 2 0 ...
$ Sample : chr "n/a" "n/a" "n/a" "n/a" ...
$ Fieldwork.dates : chr "June, 1984" "July, 1984" "Aug, 1984" "Sep 1984" ...
A specificity of the data is that general election results have been marked as “GENERAL ELECTION RESULT” in the Sample
column (take a look at the original spreadsheet to determine that). We simply extract a logical statement from that information, in order to create a TRUE/FALSE
marker called GE
for general elections. The result can be check by showing voting intentions on these precise dates.
# Mark general elections.
icm$GE <- grepl("RESULT", icm$Sample)
# Check result.
icm[icm$GE, 2:6]
End.of.fieldwork..election.date CON LAB LIB.DEM OTHER
42 11-06-1987 43.00 32.00 23.00 2.00
100 09-04-1992 43.00 35.00 18.00 4.00
165 01-05-1997 31.40 44.40 17.20 7.00
222 07-06-2001 32.70 42.00 18.80 6.50
273 05-05-2005 33.20 36.20 22.60 8.00
339 06-05-2010 36.45 29.01 23.03 11.95
A specific aspect of the Guardian/ICM dataset is the presence of dates, which can be converted to be recognized as such. The next code block creates a Date
variable in the icm
dataset that will be used in later plots. The dmy()
function of the lubridate
package is used to convert the text strings from dd-mm-yyy
(day-month-year) format into proper dates, called POSIXct
objects.
# Convert dates.
icm$Date <- dmy(icm$End.of.fieldwork..election.date)
Warning: 3 failed to parse.
# Check result.
str(icm)
'data.frame': 380 obs. of 11 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ End.of.fieldwork..election.date: chr "15-06-1984" "15-07-1984" "15-08-1984" "15-09-1984" ...
$ CON : num 37 34 36 39 38 42 41 41 38 36 ...
$ LAB : num 38 39 39 38 36 33 32 33 36 36 ...
$ LIB.DEM : num 23 26 24 21 24 24 26 25 25 27 ...
$ OTHER : num 2 1 1 2 2 1 1 1 1 1 ...
$ CON.LEAD.OVER.LABOUR : num -1 -5 -3 1 2 9 9 8 2 0 ...
$ Sample : chr "n/a" "n/a" "n/a" "n/a" ...
$ Fieldwork.dates : chr "June, 1984" "July, 1984" "Aug, 1984" "Sep 1984" ...
$ GE : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Date : POSIXct, format: "1984-06-15" "1984-07-15" ...
This process works like the as.Date()
base function shown in Session 4.2, and for which Teetor, ch. 7.9-7.11, is a good starting point. The lubridate
package is a convenience tool that deals with date formats in a more flexible way. Extracting the year from the Date
variable created above, for instance, can be done effortlessly, just as any other form of date extraction:
# List polling years.
table(year(icm$Date))
1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
7 12 12 13 12 12 12 12 16 12 12 12 12 16 12
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
12 12 20 12 12 12 16 12 13 12 11 19 12 12 6
# List general election years.
table(year(icm$Date[icm$GE]))
1987 1992 1997 2001 2005 2010
1 1 1 1 1 1
# List general election months.
table(month(icm$Date[icm$GE]))
4 5 6
1 3 2
Our final step is to remove unused information from the data by selecting the date
, ge
and voting intentions columns to form the finalized icm
dataset, which is them reshape to long format in order to write one row of data for each political party. A few missing values that correspond to undated information at the end of the series are removed from the data.
# Subset data.
icm <- icm[, c("Date", "GE", "CON", "LAB", "LIB.DEM", "OTHER")]
# Drop missing data.
icm <- na.omit(icm)
# Reshape dataset.
icm <- melt(icm, id = c("Date", "GE"), variable_name = "Party")
# Check result.
head(icm)
Date GE Party value
1 1984-06-15 FALSE CON 37
2 1984-07-15 FALSE CON 34
3 1984-08-15 FALSE CON 36
4 1984-09-15 FALSE CON 39
5 1984-10-15 FALSE CON 38
6 1984-11-15 FALSE CON 42
The operations above leave us with correctly 'timed' data. It becomes very important to know how to work with dates if you frequently analyze quaterly data (as with college enrollments) or with even more fine-grained time series, like stock values (see, e.g., Moritz Marbach's analysis of the relationship between the Frankfurt DAX index and German federal elections).
Another aspect of the data is that the “party” variable is split over columns and requires a reshape of the dataset to hold a single party
variable.
# Check party name order.
levels(icm$Party)
[1] "CON" "LAB" "LIB.DEM" "OTHER"
In the previous block, we finished by checking the levels of the Party
variable to assign it specific colors from a tone palette, using a vector of color values of the same length and order. The next code block uses Cynthia Brewer's ColorBrewer palette of discrete colors to find tinits of blue, red, orange, purple that fit each British party formation (purple for “Others” is arbitrary).
# View Set1 from ColorBrewer.
display.brewer.pal(7, "Set1")
# View selected color classes.
brewer.pal(7, "Set1")[c(2, 1, 5, 4)]
[1] "#377EB8" "#E41A1C" "#FF7F00" "#984EA3"
The selection of colors can be passed to a ggplot2
graph object through its option for manual scales. That option can be itself stored into a one-word object that we can quickly assign to any number of graphs in what follows. We create similar objects to colorize fills as well as lines and to drop titles on axes while giving a title to the overall graph.
# ggplot2 manual color palette.
colors <- scale_colour_manual(values = brewer.pal(7, "Set1")[c(2, 1, 5, 4)])
# ggplot2 manual fill color palette.
fcolors <- scale_fill_manual(values = brewer.pal(7, "Set1")[c(2, 1, 5, 4)])
# ggplot2 option to set titles.
titles <- labs(title = "Guardian/ICM voting intentions polls\n", y = NULL, x = NULL)
The ICM polling data can now be plotted as a time series of estimated party support, using the Party
variable and the ukcolors
option to determine the color of each series. We start with a ggtplot2
object that uses a line
geometry to connect observations over time, passing the graph options defined previously. The line break \n
at the end of the title adds a margin.
# Time series lines.
qplot(data = icm, y = value, x = Date, color = Party, geom = "line") + colors +
titles
A more detailed visualization shows the actual data points at a reduced, factorized size I(.75)
, along with a smoothed trend of each series. The smooth
geometry will select its own method for smoothing the series, namely a LOESS estimator. The se
option to show its confidence interval is dropped, and the same graph options as before are passed for consistency.
# Scatterplot.
qplot(data = icm, y = value, x = Date, color = Party, size = I(0.75), geom = "point") +
geom_smooth(se = FALSE) + colors + titles
Another visualization consists in stacking the entire series and showing it as an area of a common space representing 100% of voting intentions. Be careful with interpretation here: the electorate changes from an election to another, and voting intentions are not votes. Slight errors appear at the top of the graph due to rounding approximations in the polling data.
# Stacked area plot.
qplot(data = icm, y = value, x = Date, color = Party, fill = Party, stat = "identity",
geom = "area") + colors + fcolors + titles
The graph possibly contains excess information by assuming that voting intentions are volatile enough to express meaningful monthly variations. To show the same pattern with less data, we aggregate the data by averaging over years. The year()
function from the lubridate
package and the ddply()
function from the plyr
package show one possible way to achieve this result.
# Stacked bar plot.
qplot(data = ddply(icm, .(Year = year(Date), Party), summarise, value = mean(value)),
fill = Party, color = Party, x = Year, y = value, stat = "identity", geom = "bar") +
colors + fcolors + titles
You might also want to take advantage of the general elections results alone to plot actual vote shares. The slightly more complex line plot below shows them by subsetting the data to GE
observations, by extracting their year with the year()
function of the lubridate
package, by plotting a white dot where they are located, and by overimposing the year in small text above that white space.
# Plotting only general elections.
qplot(data = icm[icm$GE, ], y = value, x = Date, color = Party, size = I(0.75),
geom = "line") + geom_point(size = 12, color = "white") + geom_text(aes(label = year(Date)),
size = 4) + colors + titles
Now that we have some idea of how to represent time series visually, let's turn to the properties of time series that can be put into perspective with a bit of statistical analysis. The first one deals with temporal dependence in time series. The second one returns to graphs to explain how smoothed trends are produced. The final practice exercise shows how to model panel data.
Next: Autocorrelation.