Introduction to Data Analysis

2.1. Vectors and matrices

In R, almost everything is a vector. If it's not a vector, it might be a special kind of vector. When something is not a vector, then it's not data and is probably some form of function.

What is a vector? A vector is an ordered list of \(k\) things. The first thing is item #1 of the vector, the second thing is item #2 and so on. Here's a sequence of \(k = 4\) integer values:

# A sequence.
1:4
[1] 1 2 3 4

We are going to store the sequence into an object called x and verify that it is indeed a vector. And of course it is. It's a vector of dimension 4: it has four items.

# Store it.
x <- c(1:4)
# Is it a vector?
is.vector(x)
[1] TRUE

So vectors are just lists of values. The values must be of the same type: you cannot mix numbers and text, for instance. If you do, R will convert (recycle) all items of the vector to a single type, or “mode”, like numeric or character if there are text elements in the vector.

# A vector of numbers.
x <- c(4, 6, 8)
# A vector of strings, i.e. characters, i.e. text.
y <- c("alea", "jacta", "est")
# A 'mixed' vector will be automatically recycled.
z <- c(4, "a")

You can see the different types of your objects in the Workspace. The examples below are also vectors:

# Also a vector.
y <- "This is a vector."
# Check.
is.vector(y)
[1] TRUE
# Also a vector.
z <- 1
# Check.
is.vector(z)
[1] TRUE
# What about...
m <- cbind(1:3, 4:6)
# What type (or class) is that?
class(m)
[1] "matrix"

Okay, that's a matrix. A vector is a \(1 \times k\) unidimensional matrix. So in fact, in R, everything is a matrix, or some special kind of a matrix. Welcome to R. Note that we are going to introduce a lot of terminology below, but this is not a scientific computing class, and we are not computer scientists. We just need to make sure that we grasp the essentials.

Working with vectors

Let's combine a fictional set of student grades into a vector called grades. Let's compute the mean, the median, and finally a whole bunch of summary statistics about the resulting object.

# Create vector of random exam grades.
exam <- c(7, 13, 19, 8, 12)
# Check result.
exam
[1]  7 13 19  8 12
# Compute mean of exam vector.
mean(exam)
[1] 11.8
# Compute median of exam vector.
median(exam)
[1] 12
# Descriptive statistics for exam vector.
summary(exam)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    7.0     8.0    12.0    11.8    13.0    19.0 

Note that R uses a precise taxonomy of objects, and that even though we just used the word “list” to describe a vector, a list is not the same as a vector. If you ask R if the exam object is a vector, the result will be TRUE, but not if you ask whether exam is a list.

# Recall the exam object.
exam
[1]  7 13 19  8 12
# It's a vector.
is.vector(exam)
[1] TRUE
# Not a very long one.
length(exam)
[1] 5

Select from a vector by adding item numbers in square brackets to its name. The item numbers also form a vector.

# Select its first element; in context: show grade of first student.
exam[1]
[1] 7
# Select more than one element by listing which values you want.
exam[1:3]
[1]  7 13 19
# Select a vector of values: show grades of students no. 1, 2, 3 and 5.
exam[c(1:3, 5)]
[1]  7 13 19 12
# In context, it does makes more sense to simply exclude student no. 4's
# grade.
exam[-4]
[1]  7 13 19 12
# Vectors can be logical.
exam >= 10
[1] FALSE  TRUE  TRUE FALSE  TRUE
# Select a logical vector of values.
exam[exam >= 10]
[1] 13 19 12

What is shown here is important: in R, everything is an object. Here, exam is an object, just like the list of values that you want to show from it. This logic is everywhere in R. It is harder to learn than less flexible syntaxes, but it provides a lot of power to the user. Here's a simple example that creates a logical vector to indicate whether the exam grade is under average:

# Create a logical vector.
is_under_average <- exam < 10
# Get grades under average.
exam[is_under_average]
[1] 7 8
# Get grades above average.
exam[!is_under_average]
[1] 13 19 12

The magic here is that exam and is_under_average are two separate elements, although that brings marvel only to the eyes of those who are fascinated by vectorization: in practice, you will often end up binding the two objects together into a multidimensional vector like a matrix or a data frame, which we cover in the rest of this session.

Working with matrixes

Let's see what a different object looks like. If we want to organize some data into a “row by column” structure, we need to use a matrix–a table of numerical values, if you prefer. The as.matrix command will turn grades into a matrix.

# The exam object is not a matrix.
is.matrix(exam)
[1] FALSE
# If you make it into a matrix, the vector becomes a column of values.
as.matrix(exam)
     [,1]
[1,]    7
[2,]   13
[3,]   19
[4,]    8
[5,]   12

Let's keep that mind and create five more imaginary student essay grades that we are going to combine with the exam grades, using the cbind command to form a single matrix out of vector objects. The 'c' in cbind means that the combination is done 'vertically', by forming columns out of the vectors.

# Show length of grades vector.
length(exam)
[1] 5
# Create a random grades vector of same length.
essay <- as.integer(20 * runif(5))
# Check result.
essay
[1] 14 11 13 16  5
# Form a matrix.
grades <- cbind(exam, essay)
# Check result.
grades
     exam essay
[1,]    7    14
[2,]   13    11
[3,]   19    13
[4,]    8    16
[5,]   12     5

Let's finally compute the average for every student (rows), and read the average for exams and essays (columns).

# Compute student average.
final <- rowMeans(grades)
# Combine to grades matrix.
grades <- cbind(grades, final)
# Check result.
grades
     exam essay final
[1,]    7    14  10.5
[2,]   13    11  12.0
[3,]   19    13  16.0
[4,]    8    16  12.0
[5,]   12     5   8.5
# Compute mean exam and essay grades.
colMeans(grades)
 exam essay final 
 11.8  11.8  11.8 

You can further explore the matrix by inspecting its dimensions and providing names to its rows and columns. These operations are covered in Gaston Sanchez's cheat sheet of matrix operations, which you can add to your course bookmarks.

# How many rows?
nrow(grades)
[1] 5
# How many columns?
ncol(grades)
[1] 3
# Do they have names?
dimnames(grades)
[[1]]
NULL

[[2]]
[1] "exam"  "essay" "final"
# Create a student id sequence 1, 2, 3, ...
id <- c(1:nrow(grades))
# Check result.
id
[1] 1 2 3 4 5
# Assign it to row names.
rownames(grades) <- id
# Check result.
grades
  exam essay final
1    7    14  10.5
2   13    11  12.0
3   19    13  16.0
4    8    16  12.0
5   12     5   8.5

Let's now select elements from the matrix: as with vectors, you have to use square brackets, but since there two dimensions, you have to specify at least one of them. The syntax is “r x c” by convention: the first value specifies rows, the second columns. Understanding this system is crucial because it will guide many subsetting operations when we work with datasets.

# First row, second cell.
grades[1, 2]
[1] 14
# First row.
grades[1, ]
 exam essay final 
  7.0  14.0  10.5 
# First two rows.
grades[1:2, ]
  exam essay final
1    7    14  10.5
2   13    11  12.0
# Third column.
grades[, 3]
   1    2    3    4    5 
10.5 12.0 16.0 12.0  8.5 
# Descriptive statistics for final grades.
summary(grades[, 3])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    8.5    10.5    12.0    11.8    12.0    16.0 
# Descriptive statistics for all matrix columns.
summary(grades)
      exam          essay          final     
 Min.   : 7.0   Min.   : 5.0   Min.   : 8.5  
 1st Qu.: 8.0   1st Qu.:11.0   1st Qu.:10.5  
 Median :12.0   Median :13.0   Median :12.0  
 Mean   :11.8   Mean   :11.8   Mean   :11.8  
 3rd Qu.:13.0   3rd Qu.:14.0   3rd Qu.:12.0  
 Max.   :19.0   Max.   :16.0   Max.   :16.0  

Encoding vectors as factors

Let's anticipate on an issue that will come up often when we explore larger datasets. To illustrate the issue, let's code if the student passes or fails the test. The ifelse command works pretty much like the IF command in spreadsheet editors.

# Create a text object based on a logical condition.
pass <- ifelse(grades[, 3] >= 10, "Pass", "Fail")
# Check result.
pass
     1      2      3      4      5 
"Pass" "Pass" "Pass" "Pass" "Fail" 

The pass vector contains text items (strings). Try to combine it to the numeric grades matrix to see what happens, and find the issue with that operation (“hint”).

cbind(grades, pass)

To solve the issue, you are going to convert the text column to numeric values, while keeping the textual information as 'captions' or 'labels' for the numbers. R calls that operation factor encoding, and the 'labels' are called 'levels'.

# Understand what happens when you factor a string (text) variable.
factor(pass)
   1    2    3    4    5 
Pass Pass Pass Pass Fail 
Levels: Fail Pass
# The numeric matrix is preserved if 'pass' is added as a factor.
cbind(grades, factor(pass))
  exam essay final  
1    7    14  10.5 2
2   13    11  12.0 2
3   19    13  16.0 2
4    8    16  12.0 2
5   12     5   8.5 1
# Final operation. The '=' assignment gives a name to the column.
grades <- cbind(grades, pass = factor(pass))
# Marvel at your work.
grades
  exam essay final pass
1    7    14  10.5    2
2   13    11  12.0    2
3   19    13  16.0    2
4    8    16  12.0    2
5   12     5   8.5    1

Notice that factors are alphabetically ordered, and that this influences how they are ordered in graphs. In the simple example above, there is no particular issue with the fact that “1” stands for “Fail” and “2” for “Pass”, but if we want to order a list of countries in any other way than alphabetically, this is when we will be (re)ordering factors.

Next: Variables and factors.