Introduction to Data Analysis

8.2. Ordinary least squares

# Load packages.
packages <- c("downloader", "foreign", "ggplot2", "RColorBrewer", "reshape")
packages <- lapply(packages, FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        install.packages(x)
        library(x, character.only = TRUE)
    }
})

Let's try out multiple linear regression by modeling presidential approval as a function of economic performance in the last quarters before the election. The model is based on data provided by Larry Bartels (thanks!), and we will start by replicating his plot of the main variables of interest.

# Target locations.
link = "https://raw.github.com/briatte/ida/master/data/bartels.presvote.4812.csv"
file = "data/bartels.presvote.4812.csv"
# Download the data.
if (!file.exists(file)) download(link, file, mode = "wb")
# Load the data.
bartels <- read.csv(file, stringsAsFactors = FALSE)

Bartel's observation is that the vote margin of the incumbent's party is related to income growth, measured as the variation in disposable income per capita in the last quarters of a presidential term. The relationship is visually striking, and a simple linear regression model confirms that higher income growth predicts a higher vote margin to the incumbent.

# Scatterplot.
ggplot(bartels, aes(inc1415, incm, label = year)) + geom_smooth(method = "lm") + 
    geom_text()

plot of chunk bartels-model-1

# Simple OLS.
m1 = lm(incm ~ inc1415, data = bartels)
# Results.
summary(m1)

Call:
lm(formula = incm ~ inc1415, data = bartels)

Residuals:
   Min     1Q Median     3Q    Max 
-18.86  -5.34  -1.04   4.54  14.88 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   -0.873      3.175   -0.27    0.787  
inc1415        3.671      1.610    2.28    0.038 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.43 on 15 degrees of freedom
Multiple R-squared:  0.257, Adjusted R-squared:  0.208 
F-statistic:  5.2 on 1 and 15 DF,  p-value: 0.0377

Bartel's next step is to introduce presidential tenure into the regression equation, in order to control income growth by the number of years spent by the incumbent in power. The regression equation takes the form \(\hat Y = \beta_1 X_1 + \beta_2 X_2 + \epsilon\), where the \(\beta\) coefficients are partial derivatives to \(y\), the dependent variable.

m2 = lm(incm ~ inc1415 + I(tenure), data = bartels)
summary(m2)

Call:
lm(formula = incm ~ inc1415 + I(tenure), data = bartels)

Residuals:
   Min     1Q Median     3Q    Max 
-7.352 -3.917 -0.288  2.549  9.565 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.929      2.463    4.03   0.0012 ** 
inc1415        5.482      0.919    5.96  3.5e-05 ***
I(tenure)     -1.764      0.289   -6.11  2.7e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.1 on 14 degrees of freedom
Multiple R-squared:  0.798, Adjusted R-squared:  0.769 
F-statistic: 27.6 on 2 and 14 DF,  p-value: 1.39e-05

The coefficient that this model produces for the tenure variable indicates that tenure is a penalizer of approximately -1.76 in the equation (as noted by Bartels). Adjusting to tenure produces more accurate prediction of the dependent variable, as is observable by looking at the distribution of the residuals in each model:

# Extract model results.
m = rbind.fill(lapply(list(m1, m2), function(x) {
    model = as.character(x$call)[2]
    data.frame(model, year = bartels$year, residuals = residuals(x), yhat = fitted.values(x))
}))
# Histogram of the residuals.
qplot(data = m, x = residuals, color = model, geom = "density") + scale_color_brewer("Models:", 
    type = "qual", palette = "Set1") + theme(legend.position = "top")

plot of chunk bartels-models-plot

Another way to look at the same phenomenon is to bootstrap the estimated coefficients, which will show which parts of the model are most and least robust. The code of the visually weighted regression function that we use to bootstrap the estimated coefficients is adapted from a function by Felix Schönbrodt.

# Get vwReg function.
source("code/8_vwreg.r")
# Get color palette.
palette = brewer.pal(9, "RdYlGn")
# Code plot builder.
ggfit <- function(x) {
    bartels$yhat = fitted.values(x)
    g = vwReg(incm ~ yhat, bartels, method = lm, palette = palette) + geom_text(label = bartels$year)
    g + labs(y = "Incumbent Party Margin")
}
# Bootstrapped fitted values.
g1 = ggfit(m1)
g2 = ggfit(m2)

The final plots use a color gradient to show what happens to the linear fit of the models when some data are retrenched from the sample: with larger standard errors, the confidence intervals of the trend grow to shown large margins of uncertainty. The robust segment of the model is shown in green tint.

# Plot incumbent margin v. income growth.
g1 + labs(x = "Income Growth")

plot of chunk bartels-vwreg-auto

# Plot incumbent margin v. income growth, with tenure adjustment.
g2 + labs(x = "Income Growth, tenure-adjusted")

plot of chunk bartels-vwreg-auto

This type of visualization is useful to find ways to predict nonlinear relationships. The exame below shows how to plot the worldwide fertility rate against average female education, while controlling for a quadratic effect in their relationship. The ANOVA test serves to compare the error terms of the models.

# Download Quality of Government Standard dataset.
zip = "data/qog.cs.zip"
qog = "data/qog.cs.csv"
if (!file.exists(zip)) {
    dta = "data/qog.cs.dta"
    download("http://www.qogdata.pol.gu.se/data/qog_std_cs.dta", dta, mode = "wb")
    write.csv(read.dta(dta, warn.missing.labels = FALSE), qog)
    zip(zip, file = c(dta, qog))
    file.remove(dta, qog)
}
qog = read.csv(unz(zip, qog), stringsAsFactors = FALSE)
# Remove missing values.
qog = na.omit(with(qog, data.frame(ccodealp, wdi_fr, bl_asy25f)))
# Regression models.
m1 = lm(wdi_fr ~ bl_asy25f, qog)
m2 = lm(wdi_fr ~ bl_asy25f + I(bl_asy25f^2), qog)
# ANOVA fit test.
anova(m1, m2)

The plot is, again, produced by the vwreg function.

# Code plot builder.
ggfit = function(x, ...) {
    vwReg(formula(x), data = qog, method = lm, spag = TRUE, shade = FALSE, slices = 50, 
        ...)
}
# Visually weighted regression of linear model, without and with quadratic
# term.
p1 = ggfit(m1, spag.color = palette[1])
p2 = ggfit(m2, spag.color = palette[9], add = TRUE)

The final plot shows the amount of correction produced by the quadratic term.

# Construct plot for the regression results.
p1 + p2 + geom_point() + labs(y = "Fertility rate (number of births per woman)", 
    x = "Average education years among 25+ year-old females")

plot of chunk qog-vwreg-plot-auto

# View model results
summary(m1)

Call:
lm(formula = wdi_fr ~ bl_asy25f, data = qog)

Residuals:
   Min     1Q Median     3Q    Max 
-2.061 -0.587  0.011  0.556  2.924 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.1755     0.1958    26.4   <2e-16 ***
bl_asy25f    -0.3225     0.0242   -13.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.923 on 140 degrees of freedom
Multiple R-squared:  0.56,  Adjusted R-squared:  0.557 
F-statistic:  178 on 1 and 140 DF,  p-value: <2e-16
summary(m2)

Call:
lm(formula = wdi_fr ~ bl_asy25f + I(bl_asy25f^2), data = qog)

Residuals:
   Min     1Q Median     3Q    Max 
-1.972 -0.499 -0.084  0.479  3.223 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     6.33358    0.33096   19.14  < 2e-16 ***
bl_asy25f      -0.75777    0.10565   -7.17  4.0e-11 ***
I(bl_asy25f^2)  0.03170    0.00751    4.22  4.4e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.873 on 139 degrees of freedom
Multiple R-squared:  0.61,  Adjusted R-squared:  0.604 
F-statistic:  109 on 2 and 139 DF,  p-value: <2e-16

Next: Practice