Introduction to Data Analysis

2.2. Variables and factors

In R, everything is an object. The list of objects in memory ls() is an object itself, as shown in the example below, which lists all objects in the current workspace and wipes them from memory.

# List workspace objects.
ls()
# Erase entire workspace.
rm(list = ls())

Let's use some real, anonymized data from Autumn 2012. These are the grades from my three mathematics classes in first year. I have removed any student identification, so you have to trust me that these are the real grades (and yes, some grades range above 20/20!). The downloader package provides a handy command to download the data from the course repository, so we install and load it first.

# Install downloader package.
if (!"downloader" %in% installed.packages()[, 1]) install.packages("downloader")
# Load package.
require(downloader)
# Target file.
file = "data/grades.2012.csv"
# Download the data if needed.
if (!file.exists(file)) {
    # Locate the data.
    url = "https://raw.github.com/briatte/ida/master/data/grades.2012.csv"
    # Download the data.
    download(url, file, mode = "wb")
}
# Load the data.
grades <- read.table(file, header = TRUE)
# Check result.
head(grades)
  calc proba stats
1   18     7    15
2   15    10    14
3    4     3     8
4    7     9    14
5   20    12    19
6    3     4     4

Let's now use a package to create fake names for the students. We again need to install and load the package first: in later sessions, we will use a standard code block to install-and-load packages.

# Install randomNames package. Remember that R is case-sensitive.
if (!"randomNames" %in% installed.packages()[, 1]) install.packages("randomNames")
# Load package.
require(randomNames)
Loading required package: randomNames
# How many rows of data do we have?
(count = nrow(grades))
[1] 86
# Let's generate that many random names.
names <- randomNames(count)
# Let's finally stick them to the matrix.
grades <- cbind(grades, names)
# Check result.
head(grades)
  calc proba stats            names
1   18     7    15   Nguyen, Ashley
2   15    10    14   Maestas, Doris
3    4     3     8  Romero, Vanessa
4    7     9    14 Velasquez, Maria
5   20    12    19   Borjas, Quiana
6    3     4     4  Swanson, Austin

Data frames

Let's show a final type of object: the data frame.

# Convert to data frame.
grades <- as.data.frame(grades)
# Check result.
head(grades)
  calc proba stats            names
1   18     7    15   Nguyen, Ashley
2   15    10    14   Maestas, Doris
3    4     3     8  Romero, Vanessa
4    7     9    14 Velasquez, Maria
5   20    12    19   Borjas, Quiana
6    3     4     4  Swanson, Austin
# Check structure of a data frame.
str(grades)
'data.frame':   86 obs. of  4 variables:
 $ calc : int  18 15 4 7 20 3 6 9 18 13 ...
 $ proba: int  7 10 3 9 12 4 8 6 15 14 ...
 $ stats: int  15 14 8 14 19 4 7 9 19 18 ...
 $ names: Factor w/ 86 levels "Antonio, Gina",..: 53 43 61 75 8 67 40 35 54 59 ...

Data frames are very malleable objects: we can rearrange the variables easily with commands like melt from the reshape package.

# Install and load reshape package.
if (!"reshape" %in% installed.packages()[, 1]) install.packages("reshape")
# Load package.
require(reshape)
# Reshape data from 'wide' (lots of columns) to 'long' (lots of rows).
grades <- melt(grades, id.vars = "names")
# Check result to show how each grade is now held on a separate row.
head(grades[order(grades$names), ])
                names variable value
30      Antonio, Gina     calc    18
116     Antonio, Gina    proba    14
202     Antonio, Gina    stats    14
60  Aquiningoc, Carlo     calc     8
146 Aquiningoc, Carlo    proba    10
232 Aquiningoc, Carlo    stats    11

Let's finish with a few plots.

# Install and load ggplot2 package.
if(!"ggplot2" %in% installed.packages()[, 1])
  install.packages("ggplot2")
# Load package.
require(ggplot2)
# Plot all three exams.
qplot(data = grades, x = value, 
      group = variable, 
      geom = "density")

plot of chunk grades-plots

# Add color and transparency.
qplot(data = grades, x = value, 
      color = variable, 
      fill = variable, 
      alpha = I(.3), geom = "density")

plot of chunk grades-plots

Now use the code on this page to:

  1. Download this data extract from the U.S. National Health Interview Survey 2005. Use RCurl as shown above. Call the data nhis.

  2. Create an object called bmi that corresponds to the Body Mass Index from the height and weight columns of the nhis object. Use the U.S. formula since the data use inches and pounds. Bind the bmi object to the nhis object.

  3. Plot the results using qplot(data = ..., x = ..., geom = "density").

  4. Bonus question: explore how ggplot2 works and produce plots with the x and y variables. Guess what they stand for.

Next: Practice.