Introduction to Data Analysis

4.2. Reshapes and aggregates

Datasets do not always come in the exact form in which we want to analyze them: there is often some manipulation to do. Most people do this by hand, which is error-prone and time-consuming. This section covers a few ways to reshape a dataset. Reshaping is a set of operations to handle aggregates in your data, as when you want to compute the average income of each age and gender group.

Both examples below are based on the ggplot2 and reshape packages.

# Load packages.
packages <- c("downloader", "ggplot2", "plyr", "reshape", "scales")
packages <- lapply(packages, FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
        install.packages(x)
        library(x, character.only = TRUE)
    }
})

Visualizing the U.S. housing market by city

Here's an example based on the recent housing bubble. It shows changes in the Case Schiller Index a price index for the housing market, over the recent years.

# Open the data.
csi <- read.csv("data/schiller.8712.csv")
# Inspect the top data structure.
str(csi[, 1:5])
# Inspect the first data rows/columns.
head(csi[, 1:5])

You can see that the data uses U.S. states as columns. Instead of that “wide” format, we prefer to use “long” data where all states are held in a single column. To get this arrangement in more rows and less columns, we “collapse” the data: all column variables are put into a single one, except for the one identifier variable, year, which is repeated over rows.

# Collapse the data by years.
csi.melted <- melt(csi, id = "YEAR")
# Name the columns.
names(csi.melted) <- c("MonthYear", "City", "IndexValue")

The MonthYear variable is given a 01- prefix to make it the first day of each month, and that result is converted into a date, which is required for optimal plotting. This process uses time code format for day-month-year, %d-%b-%y, where %d is the day, %b the abbreviated month and %y the year without century. This process is explored again when we get to time series in Session 9.

# Convert dates.
csi.melted$Date <- as.Date(paste0("01-", csi.melted$MonthYear), "%d-%b-%y")

Finally, here's the first plot, using a different color for each city. There's a lot of data but the general trend is clearly understandable. You will get a warning due to missing values, but the plot should still process and look as below.

# Build line plot.
g1 <- ggplot(data = csi.melted, aes(x = Date, y = IndexValue)) +
  geom_line(aes(color = City), size = 1.25) +
  labs(x = NULL, y = "Case Schiller Index")
# View result.
g1

plot of chunk csi-plot-all-auto

Let's go further and subset the data to a few urban or large states and plot the data again. Subsetting is a very common operation, so make sure that you learn about the subset() function to subset data frames. The argument that we pass here uses the %in% selector, which returns TRUE for each value of the City variable in the csi object that is present in the cities list.

# Select only a handful of states.
cities = c('NY.New.York', 'FL.Miami', 'CA.Los Angeles', 'MI.Detroit', 
           'TX.Dallas', 'IL.Chicago', 'DC.Washington')
# Create a subset of the data.
csi.subset = subset(csi.melted, City %in% cities)
# Build plot.
g2 <- ggplot(data = csi.subset, aes(x = Date, y = IndexValue)) +
  geom_line(aes(color = City), size = 1.25) +
  labs(x = NULL, y = "Case Schiller Index")
# View result.
g2

plot of chunk csi-plot-selection-auto

Visualizing U.S. homicide trends by weapon type

In this segment, we show how to draw smooth trends of assault deaths in the United States. The data are from the Bureau of Justice Statistics and were mentioned on the Reddit /r/datasets/ channel in early 2013.

The first code block below will download the data and save it for you if you do not already have the data at hand: it looks for the htus8008 ZIP archive in your data folder (htus8008 stands for “Homicide Trends in the United States, 1980-2008”), and downloads the original data source if necessary. It then looks for a selected file inside the htus8008 folder.

# Identify ZIP folder.
zip = "data/htus8008.zip"
# Download ZIP archive.
if (!file.exists(zip)) {
    # Target data source.
    url = "http://bjs.ojp.usdoj.gov/content/pub/sheets/htus8008.zip"
    # Download ZIP archive.
    download(url, zip, mode = "wb")
}
# Inspect ZIP contents.
head(unzip(zip, list = TRUE))
             Name Length                Date
1 htus8008f01.csv   2581 2011-11-10 09:27:00
2 htus8008f02.csv   2955 2011-11-10 09:28:00
3 htus8008f03.csv   1388 2011-11-10 09:29:00
4 htus8008f04.csv   1393 2011-11-10 09:29:00
5 htus8008f05.csv    844 2011-11-10 09:32:00
6 htus8008f06.csv   1368 2011-11-10 09:33:00
# Read CSV file.
bjs <- read.csv(unz(zip, "htus8008f42.csv"), skip = 11, nrows = 29)

The second code block imports one of the CSV files by reading only the valid data lines from it. See the README file of the BJS data folder for a list of all series, and open a few files to see how they are structured. The code block also does a bit of data cleaning: it removes the last (empty) column and replaces the dots in the names of some columns with spaces. It ends on routine data checks.

# Inspect the data.
str(bjs)
'data.frame':   29 obs. of  7 variables:
 $ Year         : int  1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...
 $ Handguns     : int  10552 10324 9137 8472 8183 8165 9054 8781 9375 10225 ...
 $ Other.guns   : int  3834 3740 3501 2794 2835 2973 3126 3094 3162 3197 ...
 $ Knives       : int  4439 4364 4383 4214 3956 3996 4235 4076 3978 3923 ...
 $ Blunt.objects: int  1153 1166 1032 1098 1090 1051 1176 1169 1296 1279 ...
 $ Other.weapons: int  2295 2219 2311 2148 1999 2100 2294 2200 2171 2104 ...
 $ X            : logi  NA NA NA NA NA NA ...
# Remove last column.
bjs <- bjs[, -7]
# Clean names.
names(bjs) <- gsub("\\.", " ", names(bjs))
# Check first rows.
head(bjs)
  Year Handguns Other guns Knives Blunt objects Other weapons
1 1980    10552       3834   4439          1153          2295
2 1981    10324       3740   4364          1166          2219
3 1982     9137       3501   4383          1032          2311
4 1983     8472       2794   4214          1098          2148
5 1984     8183       2835   3956          1090          1999
6 1985     8165       2973   3996          1051          2100
# Check final rows.
tail(bjs)
   Year Handguns Other guns Knives Blunt objects Other weapons
24 2003     8830       2223   2085           745          1709
25 2004     8304       2357   2133           759          1617
26 2005     8461       2894   2145           680          1477
27 2006     8845       2702   2068           689          1428
28 2007     8396       3100   2062           734          1496
29 2008     7727       3142   2169           702          1389

Our last step is to prepare the data by reshaping it to long format with the melt() function. The final result has one unit of analysis (years), one category of observations (weapon type), and one column of values (homicide counts). We rename the columns to proper variable names and reorder the weapon types, which are factors, by the average homicide count.

# Reshape to long format.
bjs <- melt(bjs, id = "Year")
# Check result.
head(bjs)
  Year variable value
1 1980 Handguns 10552
2 1981 Handguns 10324
3 1982 Handguns  9137
4 1983 Handguns  8472
5 1984 Handguns  8183
6 1985 Handguns  8165
# Rename variables.
names(bjs) <- c("Year", "Weapon", "Count")
# Inspect weapon types.
levels(bjs$Weapon)
[1] "Handguns"      "Other guns"    "Knives"        "Blunt objects"
[5] "Other weapons"
# Order weapon type by homicide count.
bjs$Weapon <- with(bjs, reorder(Weapon, -Count, mean))

The plot is now easy to set up with ggplot2: we set the canvas to represent the homicide counts on the vertical y-axis and the year on the horizontal x-axis, and then plot one colored line per weapon category. There's all sort of tweaks that might apply at that stage of the plot. A simple one used here is to format the vertical y-scale, to show commas every 1,000 units.

# Plot canvas.
qplot(data = bjs, y = Count, x = Year, color = Weapon,
      geom = c("line", "point")) +
  labs(y = "Homicide count", x = NULL) +
  scale_y_continuous(labels = comma)

plot of chunk bjs-plot-auto

To extract whatever statistic we need from the bjs dataset, we can apply a mean or summary function to the data, splitted by weapon type. These methods were first mentioned when we described vectorization, and are shown again when we cover descriptive statistics. Four methods to calculate by groups are shown below.

# Average homicide count by weapon type, using with() and tapply().
with(bjs, tapply(Count, Weapon, mean))
     Handguns        Knives    Other guns Other weapons Blunt objects 
       9713.2        3105.2        2879.5        1905.2         974.6 
# Similar syntax with by() to get quintiles of homicide counts by weapon type.
by(bjs$Count, bjs$Weapon, quantile)
bjs$Weapon: Handguns
   0%   25%   50%   75%  100% 
 7727  8304  8845 10552 13981 
-------------------------------------------------------- 
bjs$Weapon: Knives
  0%  25%  50%  75% 100% 
2019 2133 2960 3996 4439 
-------------------------------------------------------- 
bjs$Weapon: Other guns
  0%  25%  50%  75% 100% 
2168 2539 2894 3142 3834 
-------------------------------------------------------- 
bjs$Weapon: Other weapons
  0%  25%  50%  75% 100% 
1389 1617 1933 2171 2311 
-------------------------------------------------------- 
bjs$Weapon: Blunt objects
  0%  25%  50%  75% 100% 
 680  773  981 1153 1296 

The tapply() and by() functions shown above resemble each other very much, while the next two syntaxes are more specific. The aggregate() function uses formula notation of the form y ~ x, which can also be used to write models and to facet plots in ggplot2 syntax. The ddply() function follows the syntax of the plyr package, which handles advanced data transformation routines.

# aggregate()'s formula notation of the form variable ~ group.
aggregate(Count ~ Weapon, bjs, summary)
         Weapon Count.Min. Count.1st Qu. Count.Median Count.Mean
1      Handguns       7730          8300         8840       9710
2        Knives       2020          2130         2960       3110
3    Other guns       2170          2540         2890       2880
4 Other weapons       1390          1620         1930       1910
5 Blunt objects        680           773          981        975
  Count.3rd Qu. Count.Max.
1         10600      14000
2          4000       4440
3          3140       3830
4          2170       2310
5          1150       1300
# ddply()'s more demanding syntax offers more functionality.
ddply(bjs, .(Weapon), summarise, N = length(Count), Mean = mean(Count), SD = sd(Count), 
    Min = min(Count), Max = max(Count))
         Weapon  N   Mean     SD  Min   Max
1      Handguns 29 9713.2 1876.1 7727 13981
2        Knives 29 3105.2  945.0 2019  4439
3    Other guns 29 2879.5  458.4 2168  3834
4 Other weapons 29 1905.2  295.1 1389  2311
5 Blunt objects 29  974.6  201.5  680  1296

Next: Practice.