Introduction to Data Analysis

1.1. Installing R

R is an increasingly popular software that uses a specific language to manipulate statistical objects. The R user base goes well beyond academia, and many R User Groups exist on a worldwide scale. R users form a large online community and many collaborative tools exist to help analysts share their R work.

Some people really like R. The very enthusiastic R user below is Anthony Damico, a prolific and proficient analyst of U.S. survey data. His short video is a good introduction to the spirit of R; it also mentions other programming and statistical languages:

The R language has its roots in the S language developed by AT&T, which also developed the C language. It is not the only domain-specific language available for statistical analysis: there are many others, and you might have also heard about software like SAS, SPSS or Stata, or even applications of statistics with mathematical software and scripting languages like Java or Python.

R itself is free and open software, as most of its extensions. It is as free as it is powerful (which currently makes it a “hot” piece of software), but it comes with some drawbacks. Its major disadvantage is that its learning curve is pretty steep. Do not worry: we will review as many examples as we need to make it work, and we will give you code to help creating your own.

Installation

Now, download and install the latest version of R. You will need admin privileges on your computer (i.e. the user login and password) to do so. The software costs nothing, is available for all common platforms, and does not take a lot of memory to install. The exact download link depends on your operating system and geographic location:

Do not worry if you do not manage to install R (or RStudio, the next software that you will have to install) on your own. Just download the program to your hard drive, and we will quickly guide you through installation in class. Note in passing that this course never really sanctions you for failing at doing something, only for not trying seriously enough in the first place.

Commands

Open R and locate the blinking cursor preceded by a > at the bottom of the R Console window. This is where you will type commands and read their results. You might be already familiar with that process: the logic of using a used a Command Line Interface (CLI) is vaguely similar to asking elaborate queries in a versatile search engine like DuckDuckGo.

Let's try running a few commands in the R console by typing the following lines in R and pressing Enter at the end of each line to execute, or “run”, their commands. You should skip the lines that start with a # (hash) symbol and show in a different color than the rest of the code: these lines are comments, which R will ignore if you try to run them.

# A string of characters.
"Hello R World!"
# A function that returns a string.
date()
# A numeric result.
1 + 2
# A logical statement.
1/(2 + 3) == 0.2
# A vector of integers.
1:3
# A function that returns a matrix.
as.matrix(1:3)

If you get an error message in red ink at any point, it means that you have run into a syntax error: press ↑ (UpArrow) to go back to your last command, check your typing against the original shown here, correct it, and press Enter to try it again. This is not a trick, it is a routine feature of programming environments: you will have to do this more than once!

The next sections explain a few more things about objects and assignment. We will come back to brackets and commas later, when we study functions, which is the name that we will give to R commands hereinafter (even though this is not a functional programming course, we will try to stick to formal terminology).

Syntax

You might have run into errors in the example above if you typed anything else than the code provided. That is because R, just like every other programming language, requires that you follow a precise syntax. Any familiarity with mathematical notation, and especially matrix notation, will help you at that stage, but everyone has to go through some learning curve to get R syntax right.

You might have run into these errors in particular:

Try the example below to see how changing the case can affect a given input:

# A vector of lowercase letters.
letters[1:5]
[1] "a" "b" "c" "d" "e"
# A vector of UPPERCASE letters.
LETTERS[1:5]
[1] "A" "B" "C" "D" "E"

One last element of R syntax that you will have to get familiar with is the use of punctuation. Brackets and commas, in particular, are put to intensive use in R. The examples below show some of their common usage in R. In the seq() function, the arguments from, to and by are arguments, all assigned with the equal sign = and separated by commas. Some arguments are optional.

# A sequence of integers.
1:3
[1] 1 2 3
# The same result.
seq(1, 3)
[1] 1 2 3
# A sequence of floating point numbers.
seq(from = 1, to = 3, by = 0.5)
[1] 1.0 1.5 2.0 2.5 3.0
# A function with an optional logical argument.
order(1:3, decreasing = TRUE)
[1] 3 2 1
# The same result.
rev(1:3)
[1] 3 2 1
# The order function in its default behaviour.
i <- sample(5)
j <- order(i)
list(i, j)
[[1]]
[1] 2 4 5 3 1

[[2]]
[1] 5 1 4 2 3
# Using hard brackets for vector notation.
i[order(i)]
[1] 1 2 3 4 5
# The sort function for character strings.
p <- "we come in peace"
p <- strsplit(p, " ")
p <- unlist(p)
sort(p)
[1] "come"  "in"    "peace" "we"   

Whitespace is technically ignored by R, so leaving space after a comma, for example, is not important for successful execution. Forgetting a comma or a closing bracket, however, will end up in a syntax error, and you will inevitably spend some time “debugging” your code by removing typos and other inadequacies.

Assignment

Type the lines below in their order of appearance. The code block formed by these commands can also be copy-pasted in R, but we will show you a more robust way to run code later on. The commands will not produce any visible result, which is normal: just carry on, and prepare yourself to the general eventuality that in programming, successful operations do not always end with visible output.

# Create an object called x.
x <- "Hello"
# Create an object called y.
y <- "World"

These examples show you how to assign a value to an object in R. The basic operator <- assigns the values "Hello" and "World" to the objects x and y respectively. If you are uncomfortable with using the <- symbol, you can type = instead: the two symbols are (almost) equivalent in R, although <- always means assignment and is therefore the strictest standard.

We will now bind these objects together into the object z with the c() function:

# Combine x and y into a vector called z.
z <- c(x, y)

The result can be shown with the print() function, or just by typing the name of the object.

# Print the object z on screen.
print(z)
[1] "Hello" "World"
# Just type its name do do the same.
z
[1] "Hello" "World"

Note that the order of execution was crucial to the last examples, because the z object did not exist before you created it. Generally speaking, code execution is defined by the principle that you are running lines in a certain order (hence the line numbers in all programming environments), just like you would with text or music notation.

All in all, R syntax is very much like German or Latin syntax: a bit counter-intuitive at first, but highly logical in nature. It takes a lot of practice to feel at ease with it, but it will rarely fail you. Just work through every example in the next pages to learn it step by step, and make sure to execute all code blocks in order to make sure that you get the appropriate results.

Exit

You can quit R like any other application or by typing q() from the command line interface. You might be asked whether you want to save the R work session that you started by creating the x, y and z objects in your environment. You do not need to do so here, and nothing dramatic will happen if you do (R will just save your work as a small invisible file on your hard drive).

R can do a lot of things, including waiting for you to make coffee. A big drawback of R, however, is its barebones interface. We will fix that by installing the RStudio software to “pilot” R through better menus and windows. Turn to the next section for instructions and for a quick guide to two of R's main strengths: user-contributed packages and elegant plotting facilities.

Next: Installing RStudio.