Introduction to Data Analysis

1.2. Installing RStudio

Working with R is easier when you run it through RStudio, a software interface that adds many useful features to it. Let's review some basics of both R and RStudio together, run our first R script—and produce our first plot—a random heatmap.

Installation

Use this link to download RStudio. Installation should then be pretty straightforward. You will be installing the Desktop version – the other one is for running RStudio on a server, while you will be running it as a desktop program.

RStudio comes with excellent documentation available through the Help menu. The same menu will also show you keyboard shortcuts, which will quickly come in handy. The sections below cover only the essentials, which should be enough to get you started.

A few more things about the RStudio software:

Interface

When you open RStudio for the first time, you see something like the screenshot below, composed of three quadrants. Each quadrant is called a “pane” of the RStudio window interface:

RStudio interface.

The red circle designates the Console window, where you get to type functions and see their results. The blue circle designates the Workspace, where your objects appear. The purple circle designates a bunch of other windows for files, plots, packages and help.

Running scripts

Most of the time, you will have one more pane open on the top-left side of the screen to view, edit and run R scripts. Press ⌘-⇧-N (Mac) or Ctrl-⇧-N (Win) to open a new R script. You can also use the File menu to select “New > R Script”, or use this icon:

RStudio new script.

After opening the new R script, copy-paste the entire code snippet below into it:

# Create a vector of 400 random values.
m <- runif(400)
# Print the first values of the vector.
head(m)
# Turn the vector into a 20 x 20 matrix.
m <- matrix(m, nrow = 20)
# Check the class of the matrix object.
class(m)
# Name the matrix rows with letters.
rownames(m) <- letters[1:20]
# Plot the matrix as a heatmap.
heatmap(m)

Finally, select the whole code and run it. You will see results get printed out (in black) in the Console, along with the code (in blue). The m object will also appear in the Workspace, and a random heatmap will appear in the bottom-right pane:

RStudio interface.

Setting the working directory

Let's now save the script. Click the “Files” tab in the bottom-right pane. Choose the folder where you store your student stuff, and create a new folder called IDA for this course. Then set it as the working directory through the More menu, as shown below:

RStudio working directory.

The working directory is the default location on your computer where R looks for stuff. RStudio copied that location to the Console when you completed the step described above, as if you had run the setwd() function, which sets the working directory to a given folder:

setwd("~/Documents/Teaching/IDA")

You can now save your script: focus on it by clicking anywhere in its window, press ⌘-S (Mac) or Ctrl-S (Win), and RStudio will offer to save it in your IDA folder. Save it there under a sensible name like heatmap-demo.R. R scripts should end with the .R extension.

Finally, set your course folder as the default working directory in RStudio's preferences. Open the preferences with ⌘-, (Mac) or from “Tools > Options”. The very first preference is the default working directory, which should be set to your IDA folder:

RStudio working directory.

The folder path, or location, printed above is one on my own (Mac OS X) system. Your computer will show a different one. RStudio also shows that path on top of the Console window. Tip: always opt for simple folder paths.

Last but not least: never move your IDA folder once you have set it up, or RStudio will not be able to find it. This will cause useless trouble, like running into cascades of errors because your scripts cannot find the course datasets.

Tab auto-completion

Now that the script is saved, take a minute to explore its code by navigating its lines with the keyboard arrow keys. Switch back and forth from the Source Editor (the script) to the Console (its results) by pressing Ctrl-1 and Ctrl-2 to change the cursor focus.

Your goal for the first month of class is actually to get as much practice with keyboard navigation and shortcuts as you can, so that you get used to running code from a script file while tinkering it from the Console.

A nifty feature of RStudio (and R) in that regard is tab auto-completion. Type in the first letters of a function and press Tab: possible function names will appear. The example below shows how to reach the install.packages() function through tab auto-completion:

RStudio auto-completion.

To get the auto-completion window to open, just type in a few letters like ins and then (gently) hit the Tab key. The auto-completion feature also shows you the basic function description and syntax, which you can also get in the documentation pages.

Packages

R is extremely modular: it comes with a set of “base” functions that can be supplemented with user-contributed ones, most often installed through the CRAN online central repository from which you previously downloaded R itself. This is a core feature of R, as its open source nature has encouraged the development of thousands of user-contributed packages.

Let's first take a look at which packages come with R by default and where packages are installed:

# See the default packages.
getOption("defaultPackages")
[1] "datasets"  "utils"     "grDevices" "graphics"  "stats"     "methods"  
# See where packages are stored.
.libPaths()
[1] "/Library/Frameworks/R.framework/Versions/3.0/Resources/library"

Remember that R is case sensitive: the functions above require that the O in getOption() and P in .libPaths() are typed in UPPERCASE. Also note the somewhat unusual trailing dot at the beginning of the .libPaths() function: if you omit to type . to call the .libPaths() function, R will return an error.

Let's install a few packages straight away. To do that quickly, your best option is simply to copy-paste the code below into a new R script, and to run it as you previously did with the random heatmap example.

# Create a list of essential packages.
list <- c("foreach", "knitr", "devtools", "ggplot2", "downloader", "reshape")
# Select those that are not installed.
list <- list[!list %in% installed.packages()[, 1]]
# Install the missing packages.
if(length(list))
  lapply(list, install.packages, repos = "http://cran.us.r-project.org")

Note that the last two lines form a logical statement and should be run together (i.e. consecutively) to work as intended. The result will be a lot of red ink: R is going to diagnose which of these packages you do not yet have installed, and then install them. You need to be online for this to work. Please ask us for help if you encounter any difficulties at that stage.

Once installed, the additional packages can be loaded (initialized) with the library() or require() functions. Each package brings its own functionalities: ggplot2 package is used for plotting neat graphics, downloader is used to download data from online sources. You will have to initialize the packages every time that you use their functions in R, as shown here:

# Load ggplot2 package.
library(ggplot2)
# Load downloader package.
require(downloader)
Loading required package: downloader

The “loading …” messages that you might get by running the functions above are not errors but simply a sign that other packages are loaded at the same time in the background. R prints errors, messages and warnings (non-fatal errors) in the same color, which is indeed confusing. This is one of R's idiosyncracies: you cannot trust the colors, as red ink does not always stand for “problem”.

A possibility is that you get no result at all, and this is behaviour is generalizable from installing packages to executing code overall: when you run a function, no result can be good news and mean that everything wnet alright. Reversely, getting a result does not mean that it systematically is the correct one (your function could contain a mistake, for instance).

Drawing and saving plots

One last functionality that we want to show as part of the RStudio interface has to do with plots. Let's take inspiration from a quick introduction to ggplot2, a graphics package that you should have previously installed when trying out package installation in R.

# In case you have not yet installed ggplot2, this will.
if (!require(ggplot2)) {
    # Install the ggplot2 package.
    install.packages("ggplot2")
    # Load the ggplot2 package.
    library(ggplot2)
}

We'll be looking at some example data included in the R software. The data describes Edgar Anderson's set of fifty flowers, studied by famous statistician Ronald Fisher in an article published in 1936. The data itself were collected by Edgar Anderson and published a year before. The head() function shows you the top rows of the data:

# Look at first rows of a default dataset.
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

[qplot]: http://docs.ggplot2.org/current/qplot.html qplot (ggplot2)

Let's now use the 'quick plot' functionality of the ggplot2 package. The [qplot() function][qplot] requires three arguments here: the data to use, one variable to plot on the horizontal x-axis, and another to plot on the vertical y-axis. The order of the arguments is, as usual, irrelevant, as long as you identify x and y properly in the function call.

# Quick plot.
qplot(data = iris, x = Sepal.Length, y = Petal.Length)

plot of chunk ggplot2-qplot

We will come back to how ggplot2 works in due time (here's another quick introduction showing more plots if you want a preview). For the moment, let's simply improve the plot by adding a bunch of options that will add some color, sizing and transparency effects to the plot that opened in the bottom-right pane. The next code block is just one long declaration to do that:

# Select all lines below to run properly.
qplot(data = iris, 
      x = Sepal.Length, 
      y = Petal.Length, 
      color = Species, 
      size  = Petal.Width, 
      alpha = I(0.7))

plot of chunk ggplot2-3

This plot can be saved to PNG (a bitmap format) or PDF (a vector format) from the Export menu. You will want to use PNG if you are saving the plot for screen viewing, e.g. posting on a Web page. Otherwise, you will prefer to export the plot as PDF, as shown below.

RStudio export plot.

RStudio will open a window to ask you for a few details, such as the filename for the exported graphic. The default saving location will be your working directory. After the plot is exported, it will open in your default PDF viewer, so that you can check everything went well.

Help pages

Type in help.start() to access the index of documentation pages in R. If you do not understand a specific function like qplot(), then type its name with a question mark in front of it, as in ?qplot. If that does not work, search with keywords by typing two question marks, as in ??heatmap, to perform a fuzzy search that will match documentation from all installed packages.

Help pages open in the bottom-right pane of RStudio. Try these examples out:

# Function help.
?setwd
# Keyword search.
??heatmap
# Package docs.
help(package = "downloader")

R documentation pages contain precise instructions on the syntax of each function: cover the “Description”, “Usage” and “Arguments” sections first, and turn to the “Details” section if the text sends you to it. Always check out the “Examples” section, which might (or might not, hence textbooks and tutorials) lead you to understand how the function works in context.

Reading technical documentation is a skill in itself, and not everything can be answered from there, so looking for help elsewhere is frequent practice. Note, however, that this is also a skill in itself and that there is an etiquette to getting help online: if you plan to use R-help or Stack Overflow to ask questions, turn first to Roger Peng, Jeromy Anglim or Eric Raymond.

Note, finally, that typing the name of a function without its brackets will also show some form of help, which is generally useful only if you know a bit more about low-level programming in R. Here's a simple example showing how the median() function works internally, from which you can learn that it allows an option called na.rm and belongs to the base stats package preinstalled in R:

median
function (x, na.rm = FALSE) 
UseMethod("median")
<bytecode: 0x7ff00938cee8>
<environment: namespace:stats>

Last, do not hate the documentation: everyone finds it difficult to read, and there are good reasons why the technical documentation is written how it is. It is not really meant to teach you R but rather to document it like a flight manual documents a plane: as fully and exactly as possible.

Next: Practice.