Introduction to Data Analysis

4.3. Practice

Instructions: this week’s exercise is called 4_congress.R. Download or copy-paste that file into a new R script, then open it and run it with the working directory set as the IDA folder. If you download the script, make sure that your browser preserved its .R file extension.

This week's applications use a combination of plyr and ggplot2 functions to show how vectorization works to aggregate and reshape data in R. We do not need to cover all options, but we can survey some essential ones, as Neil Saunders and Jeffrey Breen did in tutorials that help realizing how useful these operations are to manipulate datasets with R.

The exercise looks at estimates of Congressional ideology.

plot of chunk iep-plot-auto

Grouping observations to plot

The data for our first application are estimates of Congressional ideology compiled by Boris Shor for the U.S. House of Representatives in 2012. The data are downloaded in XLSX and saved in a plain text comma-separated file. The variables stored in the dataset are documented in the online codebook) and stored in a data.frame object:

# Check result.
head(data[, 1:9])
  stdist st party pid     full.name incumbent    crp.id npat.id score
1    AK1 AK     D  -1 Sharon Cissna         0 N00034750   15863 -0.69
2    AK1 AK     R   1  Donald Young         1 N00007999   26717  0.26
3    AL1 AL     R   1 Josiah Bonner         1 N00025330   27522  0.45
4    AL2 AL     D  -1  Therese Ford         0 N00033968  135114 -0.69
5    AL2 AL     R   1   Martha Roby         1 N00030768   71604  0.41
6    AL3 AL     D  -1   John Harris         0 N00034222   80329 -0.86

We now have loaded some data where each row is a single observation, namely a member of the U.S. House of Representatives in 2012. Each member has been assigned an ideal point estimate of his or her ideological stance in Congress, from liberal (-) to conservative (+). This methodology provides a measure of party polarization in Congress, if we aggregate Congressmen by party.

Aggregating by party involves doing the following: determine all possible values taken by the party variable (and limiting ourselves to the first two in this case), get the data for just one party at a time, and compute the average ideological score in that group. This whole operation can be tiresomely carried with a loop, as below, but that is precisely the kind of approach that we will avoid later on.

# The naive approach: loop over levels, subset the data, compute the scores.
for (i in levels(data$party)) {
    # Create a subset of the data for that party.
    d = subset(data, party == i)
    # Compute the mean ideal estimate point.
    cat(i, nrow(d), mean(d$score, na.rm = TRUE), "\n")
}
D 344 -0.8941 
R 374 0.8174 
X 4 -0.0675 

This loop first determines the vector of party names D, R, X (Democrat, Republican, Independent) and then painfully extracts the number of observations and mean ideological score for each party. This type of looping is easily avoided by using the vector of party names to split the data, apply a mean function to the splitted parts and combine it back to shape:

Split-Apply-Combine, by Cosma Shalizi "http://www.stat.cmu.edu/~cshalizi/statcomp/12/lectures/12/lecture-12.pdf"

The exercise shows how to use the tapply() function, which returns one mean value (third argument) based on ideological score (first argument) for each group of observations formed by party membership (second argument). The tapply() syntax is pretty straightforward, but it can also follow two alternative syntaxes using aggregate or by() with data frames:

# Simple aggregation with tapply().
tapply(data$score, data$party, mean)
      D       R       X 
-0.8941  0.8174 -0.0675 
# The by() version for data frames with factors.
by(data$score, data$party, mean)
data$party: D
[1] -0.8941
-------------------------------------------------------- 
data$party: R
[1] 0.8174
-------------------------------------------------------- 
data$party: X
[1] -0.0675
# The aggregate() version with formula notation.
aggregate(score ~ party, data = data, FUN = "mean")
  party   score
1     D -0.8941
2     R  0.8174
3     X -0.0675

We can decompose what is happening here by calling another apply() function that allows to explain each step. In the sapply() example below, we write up an arbitrary function that extracts the mean ideological scores of one party after the other in the vector of unique party names. This example shows how much simplification work is being done in a command like tapply().

sapply(levels(data$party), function(x) {
    this.party = which(data$party == x)
    mean(data$score[this.party])
})
      D       R       X 
-0.8941  0.8174 -0.0675 

Aggregating by groups also works visually, as we will see several times during the course from next week onwards, as we look into ggplot2 syntax for plots where we want to visually discriminate some groups. The exercise ends on an example of that principle to stack the distributions of Congressional ideology by party. Distributions are important statistical functions studied later in the course.

The graph implicitly groups observations by assigning different fill and colour attributes to each group of variable party. These groups are assigned colors that were previously extracted from ColorBrewer palette Set1, using the brewer.pal() function of the RColorBrewer package to extract the nine set colors and select the blue (#2), red (#1) and grey (#9) tints.

Summarizing more complex aggregate data

We continue to look at ideology in the U.S. Congress by plotting the DW-NOMINATE index, using an example adapted from David Sparks. Unlike the cross-sectional estimate for 2012 that we just used, this index is available for several years and therefore requires a more advanced reshaping strategy that splits the data by three groups: Congressional session, year, and party.

Exploring this dataset is easy if you need simple measures like raw frequencies. Here, for instance, is the number of different observations in the data for each major party affiliation, and the average DW-NOMINATE score in each group, obtained through formula notation with the aggregate() function.

# Raw frequencies (N) by party.
table(dw$majorParty)

  Democrat      Other Republican 
     17131       4008      15495 
# Mean DW-NOMINATE score by party.
aggregate(dwnom1 ~ majorParty, dw, mean)
  majorParty  dwnom1
1   Democrat -0.3068
2      Other  0.1677
3 Republican  0.3350

The exercise show more ways to do calculations by groups at that stage and then turns to the ddply() function from the plyr package to aggregate the observations by major party within each congressional session, which makes for a more complex pattern than those seen so far. The crux is the .variables separator that groups the observations by Congress and major party.

# David Sparks' transformation to session-party measurements, using plyr for the
# ddply() function and Hmisc for the weighted wtd.functions.
dw.aggregated <- ddply(.data = dw,
                       .variables = .(cong, majorParty),
                       .fun = summarise,
                       Median = wtd.quantile(dwnom1, 1/bootse1, 1/2),
                       q25 = wtd.quantile(dwnom1, 1/bootse1, 1/4),
                       q75 = wtd.quantile(dwnom1, 1/bootse1, 3/4),
                       q05 = wtd.quantile(dwnom1, 1/bootse1, 1/20),
                       q95 = wtd.quantile(dwnom1, 1/bootse1, 19/20),
                       N = length(dwnom1))

What happens here is a three-dimensional split of the data, first by Congressional session, then by political party. This operation is one of seven possible transformations in such settings. Hadley Wickham's plyr package is a set of commands like ddply() to implement these transformations, which he calls “The Split-Apply-Combine Strategy for Data Analysis” and illustrates as goes:

Split-Apply-Combine, by Hadley Wickham "http://www.jstatsoft.org/v40/i01/paper"

The cube slices show the different split-ups that can be accomplished when you combine three variables (dimensions). In our example, the result is a dataset that contains a single observation by party (Democrat, Republican, Independent) and by Congress (1-111th), with its value being the median DW-NOMINATE score of that group. The results below are the results for the most recent Congress:

# Median DW-NOMINATE score of parties in the 111th Congress.
dw.aggregated[dw.aggregated$cong == 111, ]
    cong majorParty Median    q25    q75    q05    q95   N
271  111   Democrat -0.333 -0.435 -0.232 -0.594 -0.029 262
272  111 Republican  0.606  0.510  0.698  0.242  0.819 183

The following plot derives from the combination of this strategy to a ggplot setting, where filling and colouring elements to reflect membership to a certain grouping in the data is straightforward. For more plots of Congressional ideology, check what David Sparks has done with reshaping to cast, melt and mutate a dataset containing identical data.

plot of chunk dwnominate-plot-auto

Next week: Clusters.