Introduction to Data Analysis

3.1. Control flow

Let's take a look at control flow in R. Control flow is the code that structures the way that we execute our functions. Simple control flow declarations are found everywhere in the code that we run throughout the course. One of them is the ifelse function, which tests a logical condition to assign a value:

# Nine random numbers.
x = runif(9)
# Play heads or tails.
ifelse(x > 0.5, "Heads", "Tails")
[1] "Tails" "Heads" "Tails" "Tails" "Heads" "Heads" "Heads" "Heads" "Heads"

The next code block, which shows the method used throughout the course to install and load packages, also tests a logical condition. In this example, the condition tries to run the require command and proceeds to install and load the package if it fails do so. The condition is applied to each element of a vector of package names, using the sapply function to go through the pkgs vector (which is not iteration strictly speaking; more on that later):

# List packages.
pkgs = c("downloader", "ggplot2", "plyr", "reshape", "xlsx")
# Load packages.
pkgs = sapply(pkgs, FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        # requires Internet access
        install.packages(x, quiet = TRUE)
        # load after installation
        library(x, character.only = TRUE)
    }
})

Here's another example of a conditional statement, which shows a code block that executes only when a file is missing from the data folder. Running this block will download the recession data used in a series of plots that recently circulated in the economic blogosphere. The data are converted from Excel to CSV format (more on that next week).

# Set file locations.
link = "http://www.minneapolisfed.org/publications_papers/studies/recession_perspective/data/historical_recessions_recoveries_data_05_03_2013.xls"
file = "data/us.recessions.4807.xls"
data = "data/us.recessions.4807.csv"
# Download the data.
if (!file.exists(data)) {
    if (!file.exists(file)) 
        download(link, file, mode = "wb")
    file <- read.xlsx(file, 1, startRow = 8, endRow = 80, colIndex = 1:12)
    # Fix variable names.
    year = c(1948, 1953, 1957, 1960, 1969, 1973, 1980, 1981, 1990, 2001, 2007)
    names(file) = c("t", year)
    write.csv(file, data, row.names = FALSE)
}
# Load the data.
data <- read.csv(data, stringsAsFactors = FALSE)
# Fix variable names.
names(data)[-1] <- gsub("X", "", names(data)[-1])
# Inspect the data.
str(data)
'data.frame':   72 obs. of  12 variables:
 $ t   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ 1948: num  -0.367 -1.148 -1.536 -2.115 -2.133 ...
 $ 1953: num  -0.097 -0.338 -0.582 -1.247 -1.65 ...
 $ 1957: num  -0.369 -0.683 -1.075 -1.399 -1.978 ...
 $ 1960: num  -0.62 -0.848 -0.929 -0.985 -1.065 ...
 $ 1969: num  -0.0898 0.087 0.299 0.1516 -0.1656 ...
 $ 1973: num  0.162 0.25 0.442 0.495 0.61 ...
 $ 1980: num  0.087 0.2104 0.0507 -0.424 -0.7764 ...
 $ 1981: num  -0.0393 -0.1343 -0.2435 -0.4716 -0.7752 ...
 $ 1990: num  -0.188 -0.266 -0.411 -0.545 -0.598 ...
 $ 2001: num  -0.212 -0.243 -0.338 -0.43 -0.548 ...
 $ 2007: num  0.0101 -0.0514 -0.1087 -0.2644 -0.3992 ...

If we want to recreate the recession plots, we need to restructure the data. We also return to that topic next week: for now, simply take a quick look at the syntax of the next code block. The functions require that you define more complex statements than repetitive “if… then…” declarations, and rely instead on a syntax of the form “do … by…”:

# Reshape data to long format, by year.
data = melt(data, id = "t", variable = "recession.year")
head(data)
  t recession.year   value
1 1           1948 -0.3673
2 2           1948 -1.1484
3 3           1948 -1.5356
4 4           1948 -2.1153
5 5           1948 -2.1330
6 6           1948 -2.6818
# Extract last value, by melted series.
text = ddply(na.omit(data), .(recession.year), summarise, x = max(t) + 1, y = tail(value)[6])
head(text)
  recession.year  x      y
1           1948 73  8.818
2           1953 73  6.465
3           1957 73  7.119
4           1960 73 15.734
5           1969 73  9.513
6           1973 73 16.256

And finally, here's one last way to control R code, by defining objects and piling them up into a plot. This is often used to write up graphs in ggplot2 syntax, which is shown below as a series of gg objects. The plot is produced only at the very end of the code, when all elements are passed together. Note how some elements manipulate data objects:

# Plot recession lines.
gg.base = qplot(data = data, 
                x = t, 
                y = value, 
                group = recession.year, 
                geom = "line")
# Plot 2007 recession in red.
gg.2007 = geom_line(data = subset(data, recession.year == 2007), 
                    color = "red", 
                    size = 1)
# Plot recession year labels.
gg.text = geom_text(data = text, 
                    aes(x = x, y = y, label = recession.year), 
                    hjust = 0,
                    color = ifelse(text$recession.year == 2007, "red", "black"))
# Plot zero-line.
gg.line = geom_hline(y = 0, linetype = "dashed")
# Define y-axis.
gg.axis = scale_x_continuous("Months after peak", lim = c(0, 75))
# Define y-label.
gg.ylab = labs(y = "Cumulative decline from NBER peak")
# Build plot.
gg.base + 
  gg.2007 + 
  gg.text + 
  gg.line + 
  gg.axis + 
  gg.ylab  

plot of chunk fed-ggplot2-auto

This example shows the different kinds of coding structuresthat we will have to review throughout the course: basic control flow with conditions, iteration and vectorization, and transformations to data objects stacked into ggplot2 objects for visualization purposes. Each aspect of the code is reviewed as we gradually introduce new empirical examples.

Next: Iteration.