Introduction to Data Analysis

Introduction

This is an introductory-level course on data analysis. We think that you will like this course if you enjoy facts, graphics and being able to make things by yourself. You will also need to enjoy computing and statistics to some extent. Do not be afraid, neither bite.

How we got there

We had four different ideas before we launched the course in its current form.

Our first idea was to offer an optional course that would pick you up where your first-semester mathematics and economics courses left you. Since both courses brushed upon linear models as part of their programme on empirical methods, we thought: “Let's write an introductory stats course that builds around linear models, with some background and extensions.”

(Linear models are a class of statistical techniques usually referred to as “linear regression”. We come back to linear models during the course.)

Our second idea was that this course should function as a workshop where you would learn to produce some of the analysis that you saw in slides and handbooks during your first term. Since the course is new, we thought: “Let's write an introductory stats course using a free software that students can keep using their whole lives.”

At that stage, we decided to write the course with R, a software that we will present at more length in a later section. R is taught at several universities around the world.

Our third idea derived from the complexity of R, which is less a “stats software” than a statistical computing software. You can reach pretty deep into programming with R, and this shows up in the way that you need to operate it to make things happen. Since this could also appeal to some, we thought: “Let's write an introductory stat computing course.”

(Statistical computing means using your computer to analyze data. Here's what it looks like for people your age who study statistics at Carnegie Mellon University.)

The fourth idea came from the quick realisation that we were riding several horses at once: statistics, computing, visualization and social science. So we thought: “Let's write a (whatever) course with R, and give it a name later on.” That name ended up being “data analysis”.

You'll see that the course still tastes a bit like all of the ideas above: there's a bit of math, lots of stats, tons of computing, and, we hope, even more graphs. But fundamentally, we want to make it a course about data analysis, and have you learn a set of skills in this area.

So, …

What is data analysis?

Here's a video by Jeff Leek, an assistant professor in the Biostatistics Department of the Johns Hopkins Bloomberg School of Public Health. The video explains how data analysis relates to other study topics.

Jeff Leek blogs at Simply Statistics, where his fellow biostatistician Rafael Irizarry has recently explained why statistics is not mathematics, On that topic, see also the links posted by Jerzy Wieczorek, a statistician working at the U.S. Census Bureau.

Here's another 20-minute video by Hans Rosling that will teach you some amazing vital statistics, and also explain how “liberating data” matters for development and organizations like the United Nations, the World Bank, NGOs and national or local governments:

Hans Rosling and his son developed Gapminder, a fantastic tool to animate data. Rosling is the main protagonist in the BBC documentary The Joy of Stats. He says: “I kid you not, statistics is now the sexiest subject on the planet”. The New York Times agrees.

And last, here's an hour-long video of Andrew Gelman, a professor of statistics and political science at Columbia University, talking at Google about his book Rich State Blue State: Why Americans Vote the Way They Do from 2009:

These videos give you a taste of how data, statistics and visualization are affecting the teaching and use of these tools in academia, but also in many other professions. These developments are relevant to all social sciences, whether it be public health, political science, economics…

Finally, this course is heavily influenced by the development of data visualization. To get a taste of what “dataviz” is, see this selection of methods, read some articles from Robert Kosara's blog on computational graphics, and see Simon Jackman's neat graphs of Barack Obama's electoral victory in 2012 for a brilliant illustration.

Who is this for

This course aims at showing how data analysis and visualization works in practice, and how it feeds into current trends in data journalism, open government and reproducible science.

Within that general perspective, we pursue the following teaching goals:

  1. The primary aim of the course is to provide you with some skills to understand how statistical computing can contribute to the analysis and visualization of real-world data.
  2. The secondary aim of the course is to show you how a healthy dose of programming can help taking into account large amounts of evidence about the economy, politics and society.
  3. Third and last, this course intends to provide you with a practical skill that will make you a strong applicant at universities that require quantitative methods training.

The skills outlined above will enhance your substantive expertise in the social sciences with computer and quantitative skills. This diagram by Drew Conway shows how all these skills collide around data analysis, or “data science”:

Drew Conway's Data Science Venn Diagram (Conway-Data_Science_VD)

You will find the course challenging because you will learn a lot of things at the same time, including some computer programming, a topic that is probably new to you. We assume zero previous knowledge and will proceed slowly, in a step-by-step fashion.

Course outline

The breakup of our twelve sessions together goes roughly as follows:

  1. The course starts with the R software basics (Sessions 1–4). We'll cover R syntax, objects and data operators, plus a bit of math.
  2. The course continues with a bunch of statistical methods (Sessions 5–8). We'll cover clusters, distributions, hypothesis tests and linear models.
  3. The course ends on slightly more advanced visualization techniques (Sessions 9–12). We'll cover time series, networks and maps.

The class will conclude on the professional democratization of data science, possibly with one or more guest speakers who use data in their daily work routines.

The list of topics is tentative at best, and all course sessions are still in the works.

Course requirements

The main requirements for the course are as follows:

Last thing: start all your email subjects with “IDA:” when emailing the instructors, and attach the code and data that correspond to your issue, so that we can recreate it on our end.

Disclosure

This course is a teaching experiment: we really, really need your feedback to make it work. Specifically, we'd like you to tell us what you want to do with the skills that we will teach you, so that we can help you get there.

We'll suggest exercises like drawing plots to illustrate your student newspaper, or building small-scale datasets out of student networks. But we are very open to all suggestions: tell us what you want to achieve, and we'll do our best to make it happen if it's reasonable.

We hope to present the results of this course at the R 2013 conference, in order to report on how well (or how badly!) the experiment ran. Your identity and grades will naturally not be disclosed, but we will take notes on your feedback during class and use it to write our report.

Next: Readings.