R as a data science language

The R language is a ‘DSL’ – a domain-specific language. The domain that it deals with, however, is not well-defined. In this note, I call R a “data science language” and link to a few resources that make the point better than I could.

R as a domain-specific language

A few years ago, John D. Cook gave a presentation called “The R Language: The Good The Bad & The Ugly,” in which he made an excellent series of points about R as

a domain-specific language for data analysis.

The entire presentation is worth watching. Below are a few highlights:

When I watched this presentation for the first time, I felt that it formulated all the important reasons for choosing R over another language to learn and then write up data-analytical code.

In hindsight, the only flaw that I find to John D. Cook's presentation is that it restricts the domain that R intends to cover as a domain-specific language. More specifically, Cook describes R as a statistical language, noting, for instance, that ‘books about R’ are generally books about statistics.

This characterisation of the R language is likely to have been adequate a few years ago, and it was certainly adequate to characterise the S language, from which R emerged. As of today, however, there are good reasons to consider that R has evolved to become a language about more than statistics.

R as a data science language

I recently discovered the blog maintained by Joshua Ebner at Sharp Sight Labs, and as with the previous presentation, I got the feeling that the author was making all the important points about the R language.

Below is a selection of Sharp Sight Labs blog posts from the past two years, listed by reverse chronological order, and summarised through an important quote:

Why R is the best data science language to learn today

… I want to explain all of the reasons why I’m very optimistic about R’s long term prospects, and why I think it’s perhaps the best data science language to learn today.
Why you should master R (even if it might eventually become obsolete)

… data science is changing very fast, and any tool that you learn will eventually become obsolete.
How much data science do you actually remember?

... if you want to get a good data science job, you need to really know your stuff. You need to remember how to write the code, from memory, on command.
Stop trying to jump to the sexy stuff first

[Top performers] don’t demand to start with the cool stuff. Top performers diligently learn and master the foundations.
The real prerequisite for machine learning isn’t math, it’s data analysis

The reality is that in industry, data scientists just don’t do much higher level math.

But most data scientists do spend a huge amount of their time getting data, cleaning data, and exploring data. This applies both to data science generally, and machine learning specifically; and it particularly applies to beginners.

If you want to get started with machine learning, the real prerequisite skill that you need to learn is data analysis.
Why you should start by learning data visualization and manipulation

… learn data visualization first and then learn data manipulation.
How data visualizations are tools (and what you’re building with them)

… you should not approach learning data science from a software development point of view… You should start by learning how to find and deliver insights from data.

These blog posts do a brilliant job at explaining how to learn data science, and what to start with. The recommended language is R because its packages cover data collection, manipulation and visualization, machine learning, and statistics—all of which are building blocks of one might want to call “data science” in the current context.

Why does any of that matter?

Why is it important to clarify the domain covered by the R language?

In the short term, doing so serves two purposes:

First, to avoid misunderstandings.

As John D. Cook explains, R is not a general programming language, but a domain-specific language that has escaped its initial niche to cover a larger range of interests. In my opinion, this renders many (but not all) comparisons between R and other languages essentially irrelevant.
Second, to put R on the skills map.

We need expressions of what R is, and what it is useful for, to make it academically as well as professionally relevant, well beyond the scope of statistical science, into the vaster field of practice that is gradually being identified as “data science.”

In the longer term, knowing what R does should also help understanding what R is unlikely to become in the future, and correlatively, help the developer communities that maintain other programming languages or runtime environments to realise what they might be able to achieve by interfacing with R, or by developing software inspired by it.

Update (January 5, 2017): for a somewhat different perspective, see this interview of Joe Cheng, published yesterday on the R Views blog. Key quote:

… people say that one of the differences between say Python or Julia and R is that R is a DSL for stats, whereas these other things are general purpose languages. R is not a DSL. It’s a language for writing DSLs, which is something that’s altogether more powerful.

Update (January 5, 2017): this other note cites additional technologies that are interesting to learn for data science.

First published on January 5th, 2017

Other notes