Introduction to Data Analysis

# 4.2. Reshapes and aggregates

Datasets do not always come in the exact form in which we want to analyze them: there is often some manipulation to do. Most people do this by hand, which is error-prone and time-consuming. This section covers a few ways to reshape a dataset. Reshaping is a set of operations to handle aggregates in your data, as when you want to compute the average income of each age and gender group.

Both examples below are based on the ggplot2 and reshape packages.

# Load packages.
packages <- c("downloader", "ggplot2", "plyr", "reshape", "scales")
packages <- lapply(packages, FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x)
library(x, character.only = TRUE)
}
})


## Visualizing the U.S. housing market by city

Here's an example based on the recent housing bubble. It shows changes in the Case Schiller Index a price index for the housing market, over the recent years.

# Open the data.
# Inspect the top data structure.
str(csi[, 1:5])
# Inspect the first data rows/columns.


You can see that the data uses U.S. states as columns. Instead of that “wide” format, we prefer to use “long” data where all states are held in a single column. To get this arrangement in more rows and less columns, we “collapse” the data: all column variables are put into a single one, except for the one identifier variable, year, which is repeated over rows.

# Collapse the data by years.
csi.melted <- melt(csi, id = "YEAR")
# Name the columns.
names(csi.melted) <- c("MonthYear", "City", "IndexValue")


The MonthYear variable is given a 01- prefix to make it the first day of each month, and that result is converted into a date, which is required for optimal plotting. This process uses time code format for day-month-year, %d-%b-%y, where %d is the day, %b the abbreviated month and %y the year without century. This process is explored again when we get to time series in Session 9.

# Convert dates.
csi.melted$Date <- as.Date(paste0("01-", csi.melted$MonthYear), "%d-%b-%y")


Finally, here's the first plot, using a different color for each city. There's a lot of data but the general trend is clearly understandable. You will get a warning due to missing values, but the plot should still process and look as below.

# Build line plot.
g1 <- ggplot(data = csi.melted, aes(x = Date, y = IndexValue)) +
geom_line(aes(color = City), size = 1.25) +
labs(x = NULL, y = "Case Schiller Index")
# View result.
g1


Let's go further and subset the data to a few urban or large states and plot the data again. Subsetting is a very common operation, so make sure that you learn about the subset() function to subset data frames. The argument that we pass here uses the %in% selector, which returns TRUE for each value of the City variable in the csi object that is present in the cities list.

# Select only a handful of states.
cities = c('NY.New.York', 'FL.Miami', 'CA.Los Angeles', 'MI.Detroit',
'TX.Dallas', 'IL.Chicago', 'DC.Washington')
# Create a subset of the data.
csi.subset = subset(csi.melted, City %in% cities)
# Build plot.
g2 <- ggplot(data = csi.subset, aes(x = Date, y = IndexValue)) +
geom_line(aes(color = City), size = 1.25) +
labs(x = NULL, y = "Case Schiller Index")
# View result.
g2


## Visualizing U.S. homicide trends by weapon type

In this segment, we show how to draw smooth trends of assault deaths in the United States. The data are from the Bureau of Justice Statistics and were mentioned on the Reddit /r/datasets/ channel in early 2013.

The first code block below will download the data and save it for you if you do not already have the data at hand: it looks for the htus8008 ZIP archive in your data folder (htus8008 stands for “Homicide Trends in the United States, 1980-2008”), and downloads the original data source if necessary. It then looks for a selected file inside the htus8008 folder.

# Identify ZIP folder.
zip = "data/htus8008.zip"
if (!file.exists(zip)) {
# Target data source.
url = "http://bjs.ojp.usdoj.gov/content/pub/sheets/htus8008.zip"
}
# Inspect ZIP contents.
head(unzip(zip, list = TRUE))

             Name Length                Date
1 htus8008f01.csv   2581 2011-11-10 09:27:00
2 htus8008f02.csv   2955 2011-11-10 09:28:00
3 htus8008f03.csv   1388 2011-11-10 09:29:00
4 htus8008f04.csv   1393 2011-11-10 09:29:00
5 htus8008f05.csv    844 2011-11-10 09:32:00
6 htus8008f06.csv   1368 2011-11-10 09:33:00

# Read CSV file.
bjs <- read.csv(unz(zip, "htus8008f42.csv"), skip = 11, nrows = 29)


The second code block imports one of the CSV files by reading only the valid data lines from it. See the README file of the BJS data folder for a list of all series, and open a few files to see how they are structured. The code block also does a bit of data cleaning: it removes the last (empty) column and replaces the dots in the names of some columns with spaces. It ends on routine data checks.

# Inspect the data.
str(bjs)

'data.frame':   29 obs. of  7 variables:
$Year : int 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...$ Handguns     : int  10552 10324 9137 8472 8183 8165 9054 8781 9375 10225 ...
$Other.guns : int 3834 3740 3501 2794 2835 2973 3126 3094 3162 3197 ...$ Knives       : int  4439 4364 4383 4214 3956 3996 4235 4076 3978 3923 ...
$Blunt.objects: int 1153 1166 1032 1098 1090 1051 1176 1169 1296 1279 ...$ Other.weapons: int  2295 2219 2311 2148 1999 2100 2294 2200 2171 2104 ...
$X : logi NA NA NA NA NA NA ...  # Remove last column. bjs <- bjs[, -7] # Clean names. names(bjs) <- gsub("\\.", " ", names(bjs)) # Check first rows. head(bjs)   Year Handguns Other guns Knives Blunt objects Other weapons 1 1980 10552 3834 4439 1153 2295 2 1981 10324 3740 4364 1166 2219 3 1982 9137 3501 4383 1032 2311 4 1983 8472 2794 4214 1098 2148 5 1984 8183 2835 3956 1090 1999 6 1985 8165 2973 3996 1051 2100  # Check final rows. tail(bjs)   Year Handguns Other guns Knives Blunt objects Other weapons 24 2003 8830 2223 2085 745 1709 25 2004 8304 2357 2133 759 1617 26 2005 8461 2894 2145 680 1477 27 2006 8845 2702 2068 689 1428 28 2007 8396 3100 2062 734 1496 29 2008 7727 3142 2169 702 1389  Our last step is to prepare the data by reshaping it to long format with the melt() function. The final result has one unit of analysis (years), one category of observations (weapon type), and one column of values (homicide counts). We rename the columns to proper variable names and reorder the weapon types, which are factors, by the average homicide count. # Reshape to long format. bjs <- melt(bjs, id = "Year") # Check result. head(bjs)   Year variable value 1 1980 Handguns 10552 2 1981 Handguns 10324 3 1982 Handguns 9137 4 1983 Handguns 8472 5 1984 Handguns 8183 6 1985 Handguns 8165  # Rename variables. names(bjs) <- c("Year", "Weapon", "Count") # Inspect weapon types. levels(bjs$Weapon)

[1] "Handguns"      "Other guns"    "Knives"        "Blunt objects"
[5] "Other weapons"

# Order weapon type by homicide count.
bjs$Weapon <- with(bjs, reorder(Weapon, -Count, mean))  The plot is now easy to set up with ggplot2: we set the canvas to represent the homicide counts on the vertical y-axis and the year on the horizontal x-axis, and then plot one colored line per weapon category. There's all sort of tweaks that might apply at that stage of the plot. A simple one used here is to format the vertical y-scale, to show commas every 1,000 units. # Plot canvas. qplot(data = bjs, y = Count, x = Year, color = Weapon, geom = c("line", "point")) + labs(y = "Homicide count", x = NULL) + scale_y_continuous(labels = comma)  To extract whatever statistic we need from the bjs dataset, we can apply a mean or summary function to the data, splitted by weapon type. These methods were first mentioned when we described vectorization, and are shown again when we cover descriptive statistics. Four methods to calculate by groups are shown below. # Average homicide count by weapon type, using with() and tapply(). with(bjs, tapply(Count, Weapon, mean))   Handguns Knives Other guns Other weapons Blunt objects 9713.2 3105.2 2879.5 1905.2 974.6  # Similar syntax with by() to get quintiles of homicide counts by weapon type. by(bjs$Count, bjs$Weapon, quantile)  bjs$Weapon: Handguns
0%   25%   50%   75%  100%
7727  8304  8845 10552 13981
--------------------------------------------------------
bjs$Weapon: Knives 0% 25% 50% 75% 100% 2019 2133 2960 3996 4439 -------------------------------------------------------- bjs$Weapon: Other guns
0%  25%  50%  75% 100%
2168 2539 2894 3142 3834
--------------------------------------------------------
bjs$Weapon: Other weapons 0% 25% 50% 75% 100% 1389 1617 1933 2171 2311 -------------------------------------------------------- bjs$Weapon: Blunt objects
0%  25%  50%  75% 100%
680  773  981 1153 1296


The tapply() and by() functions shown above resemble each other very much, while the next two syntaxes are more specific. The aggregate() function uses formula notation of the form y ~ x, which can also be used to write models and to facet plots in ggplot2 syntax. The ddply() function follows the syntax of the plyr package, which handles advanced data transformation routines.

# aggregate()'s formula notation of the form variable ~ group.
aggregate(Count ~ Weapon, bjs, summary)

         Weapon Count.Min. Count.1st Qu. Count.Median Count.Mean
1      Handguns       7730          8300         8840       9710
2        Knives       2020          2130         2960       3110
3    Other guns       2170          2540         2890       2880
4 Other weapons       1390          1620         1930       1910
5 Blunt objects        680           773          981        975
Count.3rd Qu. Count.Max.
1         10600      14000
2          4000       4440
3          3140       3830
4          2170       2310
5          1150       1300

# ddply()'s more demanding syntax offers more functionality.
ddply(bjs, .(Weapon), summarise, N = length(Count), Mean = mean(Count), SD = sd(Count),
Min = min(Count), Max = max(Count))

         Weapon  N   Mean     SD  Min   Max
1      Handguns 29 9713.2 1876.1 7727 13981
2        Knives 29 3105.2  945.0 2019  4439
3    Other guns 29 2879.5  458.4 2168  3834
4 Other weapons 29 1905.2  295.1 1389  2311
5 Blunt objects 29  974.6  201.5  680  1296


Next: Practice.