Datasets do not always come in the exact form in which we want to analyze them: there is often some manipulation to do. Most people do this by hand, which is error-prone and time-consuming. This section covers a few ways to reshape a dataset. Reshaping is a set of operations to handle aggregates in your data, as when you want to compute the average income of each age and gender group.
Both examples below are based on the ggplot2
and reshape
packages.
# Load packages.
packages <- c("downloader", "ggplot2", "plyr", "reshape", "scales")
packages <- lapply(packages, FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x)
library(x, character.only = TRUE)
}
})
Here's an example based on the recent housing bubble. It shows changes in the Case Schiller Index a price index for the housing market, over the recent years.
# Open the data.
csi <- read.csv("data/schiller.8712.csv")
# Inspect the top data structure.
str(csi[, 1:5])
# Inspect the first data rows/columns.
head(csi[, 1:5])
You can see that the data uses U.S. states as columns. Instead of that “wide” format, we prefer to use “long” data where all states are held in a single column. To get this arrangement in more rows and less columns, we “collapse” the data: all column variables are put into a single one, except for the one identifier variable, year, which is repeated over rows.
# Collapse the data by years.
csi.melted <- melt(csi, id = "YEAR")
# Name the columns.
names(csi.melted) <- c("MonthYear", "City", "IndexValue")
The MonthYear
variable is given a 01-
prefix to make it the first day of each month, and that result is converted into a date, which is required for optimal plotting. This process uses time code format for day-month-year, %d-%b-%y
, where %d
is the day, %b
the abbreviated month and %y
the year without century. This process is explored again when we get to time series in Session 9.
# Convert dates.
csi.melted$Date <- as.Date(paste0("01-", csi.melted$MonthYear), "%d-%b-%y")
Finally, here's the first plot, using a different color for each city. There's a lot of data but the general trend is clearly understandable. You will get a warning due to missing values, but the plot should still process and look as below.
# Build line plot.
g1 <- ggplot(data = csi.melted, aes(x = Date, y = IndexValue)) +
geom_line(aes(color = City), size = 1.25) +
labs(x = NULL, y = "Case Schiller Index")
# View result.
g1
Let's go further and subset the data to a few urban or large states and plot the data again. Subsetting is a very common operation, so make sure that you learn about the subset()
function to subset data frames. The argument that we pass here uses the %in%
selector, which returns TRUE for each value of the City
variable in the csi
object that is present in the cities
list.
# Select only a handful of states.
cities = c('NY.New.York', 'FL.Miami', 'CA.Los Angeles', 'MI.Detroit',
'TX.Dallas', 'IL.Chicago', 'DC.Washington')
# Create a subset of the data.
csi.subset = subset(csi.melted, City %in% cities)
# Build plot.
g2 <- ggplot(data = csi.subset, aes(x = Date, y = IndexValue)) +
geom_line(aes(color = City), size = 1.25) +
labs(x = NULL, y = "Case Schiller Index")
# View result.
g2
In this segment, we show how to draw smooth trends of assault deaths in the United States. The data are from the Bureau of Justice Statistics and were mentioned on the Reddit /r/datasets/ channel in early 2013.
The first code block below will download the data and save it for you if you do not already have the data at hand: it looks for the htus8008
ZIP archive in your data
folder (htus8008
stands for “Homicide Trends in the United States, 1980-2008”), and downloads the original data source if necessary. It then looks for a selected file inside the htus8008
folder.
# Identify ZIP folder.
zip = "data/htus8008.zip"
# Download ZIP archive.
if (!file.exists(zip)) {
# Target data source.
url = "http://bjs.ojp.usdoj.gov/content/pub/sheets/htus8008.zip"
# Download ZIP archive.
download(url, zip, mode = "wb")
}
# Inspect ZIP contents.
head(unzip(zip, list = TRUE))
Name Length Date
1 htus8008f01.csv 2581 2011-11-10 09:27:00
2 htus8008f02.csv 2955 2011-11-10 09:28:00
3 htus8008f03.csv 1388 2011-11-10 09:29:00
4 htus8008f04.csv 1393 2011-11-10 09:29:00
5 htus8008f05.csv 844 2011-11-10 09:32:00
6 htus8008f06.csv 1368 2011-11-10 09:33:00
# Read CSV file.
bjs <- read.csv(unz(zip, "htus8008f42.csv"), skip = 11, nrows = 29)
The second code block imports one of the CSV files by reading only the valid data lines from it. See the README
file of the BJS data folder for a list of all series, and open a few files to see how they are structured. The code block also does a bit of data cleaning: it removes the last (empty) column and replaces the dots in the names of some columns with spaces. It ends on routine data checks.
# Inspect the data.
str(bjs)
'data.frame': 29 obs. of 7 variables:
$ Year : int 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...
$ Handguns : int 10552 10324 9137 8472 8183 8165 9054 8781 9375 10225 ...
$ Other.guns : int 3834 3740 3501 2794 2835 2973 3126 3094 3162 3197 ...
$ Knives : int 4439 4364 4383 4214 3956 3996 4235 4076 3978 3923 ...
$ Blunt.objects: int 1153 1166 1032 1098 1090 1051 1176 1169 1296 1279 ...
$ Other.weapons: int 2295 2219 2311 2148 1999 2100 2294 2200 2171 2104 ...
$ X : logi NA NA NA NA NA NA ...
# Remove last column.
bjs <- bjs[, -7]
# Clean names.
names(bjs) <- gsub("\\.", " ", names(bjs))
# Check first rows.
head(bjs)
Year Handguns Other guns Knives Blunt objects Other weapons
1 1980 10552 3834 4439 1153 2295
2 1981 10324 3740 4364 1166 2219
3 1982 9137 3501 4383 1032 2311
4 1983 8472 2794 4214 1098 2148
5 1984 8183 2835 3956 1090 1999
6 1985 8165 2973 3996 1051 2100
# Check final rows.
tail(bjs)
Year Handguns Other guns Knives Blunt objects Other weapons
24 2003 8830 2223 2085 745 1709
25 2004 8304 2357 2133 759 1617
26 2005 8461 2894 2145 680 1477
27 2006 8845 2702 2068 689 1428
28 2007 8396 3100 2062 734 1496
29 2008 7727 3142 2169 702 1389
Our last step is to prepare the data by reshaping it to long format with the melt()
function. The final result has one unit of analysis (years), one category of observations (weapon type), and one column of values (homicide counts). We rename the columns to proper variable names and reorder the weapon types, which are factors, by the average homicide count.
# Reshape to long format.
bjs <- melt(bjs, id = "Year")
# Check result.
head(bjs)
Year variable value
1 1980 Handguns 10552
2 1981 Handguns 10324
3 1982 Handguns 9137
4 1983 Handguns 8472
5 1984 Handguns 8183
6 1985 Handguns 8165
# Rename variables.
names(bjs) <- c("Year", "Weapon", "Count")
# Inspect weapon types.
levels(bjs$Weapon)
[1] "Handguns" "Other guns" "Knives" "Blunt objects"
[5] "Other weapons"
# Order weapon type by homicide count.
bjs$Weapon <- with(bjs, reorder(Weapon, -Count, mean))
The plot is now easy to set up with ggplot2
: we set the canvas to represent the homicide counts on the vertical y-axis and the year on the horizontal x-axis, and then plot one colored line per weapon category. There's all sort of tweaks that might apply at that stage of the plot. A simple one used here is to format the vertical y-scale, to show commas every 1,000 units.
# Plot canvas.
qplot(data = bjs, y = Count, x = Year, color = Weapon,
geom = c("line", "point")) +
labs(y = "Homicide count", x = NULL) +
scale_y_continuous(labels = comma)
To extract whatever statistic we need from the bjs
dataset, we can apply a mean
or summary
function to the data, splitted by weapon type. These methods were first mentioned when we described vectorization, and are shown again when we cover descriptive statistics. Four methods to calculate by groups are shown below.
# Average homicide count by weapon type, using with() and tapply().
with(bjs, tapply(Count, Weapon, mean))
Handguns Knives Other guns Other weapons Blunt objects
9713.2 3105.2 2879.5 1905.2 974.6
# Similar syntax with by() to get quintiles of homicide counts by weapon type.
by(bjs$Count, bjs$Weapon, quantile)
bjs$Weapon: Handguns
0% 25% 50% 75% 100%
7727 8304 8845 10552 13981
--------------------------------------------------------
bjs$Weapon: Knives
0% 25% 50% 75% 100%
2019 2133 2960 3996 4439
--------------------------------------------------------
bjs$Weapon: Other guns
0% 25% 50% 75% 100%
2168 2539 2894 3142 3834
--------------------------------------------------------
bjs$Weapon: Other weapons
0% 25% 50% 75% 100%
1389 1617 1933 2171 2311
--------------------------------------------------------
bjs$Weapon: Blunt objects
0% 25% 50% 75% 100%
680 773 981 1153 1296
The tapply()
and by()
functions shown above resemble each other very much, while the next two syntaxes are more specific. The aggregate()
function uses formula notation of the form y ~ x
, which can also be used to write models and to facet plots in ggplot2
syntax. The ddply()
function follows the syntax of the plyr
package, which handles advanced data transformation routines.
# aggregate()'s formula notation of the form variable ~ group.
aggregate(Count ~ Weapon, bjs, summary)
Weapon Count.Min. Count.1st Qu. Count.Median Count.Mean
1 Handguns 7730 8300 8840 9710
2 Knives 2020 2130 2960 3110
3 Other guns 2170 2540 2890 2880
4 Other weapons 1390 1620 1930 1910
5 Blunt objects 680 773 981 975
Count.3rd Qu. Count.Max.
1 10600 14000
2 4000 4440
3 3140 3830
4 2170 2310
5 1150 1300
# ddply()'s more demanding syntax offers more functionality.
ddply(bjs, .(Weapon), summarise, N = length(Count), Mean = mean(Count), SD = sd(Count),
Min = min(Count), Max = max(Count))
Weapon N Mean SD Min Max
1 Handguns 29 9713.2 1876.1 7727 13981
2 Knives 29 3105.2 945.0 2019 4439
3 Other guns 29 2879.5 458.4 2168 3834
4 Other weapons 29 1905.2 295.1 1389 2311
5 Blunt objects 29 974.6 201.5 680 1296
Next: Practice.