Introduction to Data Analysis

3.2. Iteration

This section briefly reviews iteration, which is very important from a programming perspective as an element of control flow, and also important from the viewpoint of data analysis because it fuels some useful procedures for data crunching, such as parallel computing for intensive computing jobs on very large datasets, or bootstrapping a statistical procedure.

Loops

R understands loops like many other programming languages. The idea is the same for both operators: allow to repeat an operation a certain number of times. The first example is as trivial as it gets: the loop creates one line of output for each value of \(n = (1, 2, 3)\) passed through i. The final element of the string concatenation, \n, is a line break.

# Set the counter.
n = 3
# Loop n times.
for (i in 1:n) {
    cat("Sheep #", i, "\n")
}
Sheep # 1 
Sheep # 2 
Sheep # 3 

This example can be easily replaced by a one-line approach that goes straight to applying the function to the three elements of the vector i, 1, 2, 3. This is possible because the paste() function is vectorized. We are introducing this idea straight away because it will replace our apparent need for looping in many circumstances.

# Vectorized approach.
paste("Sheep #", 1:3)
[1] "Sheep # 1" "Sheep # 2" "Sheep # 3"

A while() loop operates almost on the same level as a for() loop: it iterates until a condition becomes false, or, if you prefer, while a condition is true. Be careful: if the condition never becomes false, your loop will iterate indefinitely. This approach is used only under uncertainty conditions that we can safely skip here.

# Set the counter.
i <- 5
# Decrease i by an increment of 1 until it reaches 0.
while (i > 0) {
    print(i <- i - 1)
}
[1] 4
[1] 3
[1] 2
[1] 1
[1] 0

A more common situation of uncertainty is when you are looking for the solution of a “find-a-number” game, which you might be able to solve by trying out all possible numbers through a loop. This is the 'brute-force' approach, which might be less entertaining than trying to solve the game in the first place.

Loops proceed by sequential exhaustion (“do that to \(x\)”, “do it again to \(x+1\)”, “do it again…”) and can become computationally intensive when they iterate through large arrays of values. At that stage, your code might require optimization, and you might need to use several computer processors to parallelize the execution of the loop.

Vectorization

Vectorization is a way to replace (most) loops with a more efficient way to handle the data. R contains a lot of vectorized functions that will run significantly quicker than memory-intensive loops of code. The basic idea, as outlined above, is to apply a function to a vector of observations, rather than telling R to separately apply the function to each item:

# The iterative way: square i, square i + 1, square i + 2, square i + 3.
for (i in 1:3) print(i^2)
[1] 1
[1] 4
[1] 9
# The vectorized way: define i = { 1, 2, 3 } and square each i element.
(1:3)^2
[1] 1 4 9

The group of observations in this example is the vector 1, 2, 3. The square function works on vectors, which is why you get a vector of results in return. Many functions accept vectors as arguments, which makes it possible to compute things efficiently, as in this example of an explosive roommate combinations function, coded by Francis Smart:

# Number of possible conflicts in a pool of two roommates (single
# combination).
prod(1:2)/prod(1:2)
[1] 1
# With three roommates, there are three potential combinations (AB, BC and
# AC).
prod(1:3)/(prod(1:2) * prod(1:1))
[1] 3
# With four roommates, there are six potential combinations (AB, BC, CD,
# ...).
prod(1:4)/(prod(1:2) * prod(1:2))
[1] 6

This function looks at the number of potential conflicts that might arise when “each roommate pair has the possibility of severely bothering another roommate”. The code calculates that value from the combination equation, \(\frac{n!}{k!(n-k)!}\), where \(n\) is the number of roommates and \(k = 2\), the number of people in a conflict. The function can be calculated for an increasing number of roommates.

To plot part of the function, we can create a vector of values and apply the explosive roommates combination to it. This is done below by writing up the function into a sapply() function, the vectorized equivalent of a for loop. The sapply() function goes through the vector of values x and returns a vector of results y after applying the explosive_roommates function:

# Explosive roommates function.
explosive_roommates = function(n) prod(1:n) / (prod(1:2) * prod(1:max(n - 2, 1)))
# Vector of x values.
x = 2:20
# Vector of y values.
y = sapply(x, explosive_roommates)
# Plot y against x.
qplot(x, y, geom = c("line", "point")) +
  labs(y = "Number of potential conflicts", x = "Number of roommates")

plot of chunk fx-expl-plot-auto

Next: Practice.