R as a data science language

The R language is a ‘DSL’ – a domain-specific language. The domain that it deals with, however, is not well-defined. In this note, I call R a “data science language” and link to a few resources that make the point better than I could.

R as a domain-specific language

A few years ago, John D. Cook gave a presentation called “The R Language: The Good The Bad & The Ugly,” in which he made an excellent series of points about R as

a domain-specific language for data analysis.

The entire presentation is worth watching. Below are a few highlights:

When I watched this presentation for the first time, I felt that it formulated all the important reasons for choosing R over another language to learn and then write up data-analytical code.

In hindsight, the only flaw that I find to John D. Cook's presentation is that it restricts the domain that R intends to cover as a domain-specific language. More specifically, Cook describes R as a statistical language, noting, for instance, that ‘books about R’ are generally books about statistics.

This characterisation of the R language is likely to have been adequate a few years ago, and it was certainly adequate to characterise the S language, from which R emerged. As of today, however, there are good reasons to consider that R has evolved to become a language about more than statistics.

R as a data science language

I recently discovered the blog maintained by Joshua Ebner at Sharp Sight Labs, and as with the previous presentation, I got the feeling that the author was making all the important points about the R language.

Below is a selection of Sharp Sight Labs blog posts from the past two years, listed by reverse chronological order, and summarised through an important quote:

These blog posts do a brilliant job at explaining how to learn data science, and what to start with. The recommended language is R because its packages cover data collection, manipulation and visualization, machine learning, and statistics—all of which are building blocks of one might want to call “data science” in the current context.

Why does any of that matter?

Why is it important to clarify the domain covered by the R language?

In the short term, doing so serves two purposes:

  1. First, to avoid misunderstandings.

    As John D. Cook explains, R is not a general programming language, but a domain-specific language that has escaped its initial niche to cover a larger range of interests. In my opinion, this renders many (but not all) comparisons between R and other languages essentially irrelevant.

  2. Second, to put R on the skills map.

    We need expressions of what R is, and what it is useful for, to make it academically as well as professionally relevant, well beyond the scope of statistical science, into the vaster field of practice that is gradually being identified as “data science.”

In the longer term, knowing what R does should also help understanding what R is unlikely to become in the future, and correlatively, help the developer communities that maintain other programming languages or runtime environments to realise what they might be able to achieve by interfacing with R, or by developing software inspired by it.


Update (January 5, 2017): for a somewhat different perspective, see this interview of Joe Cheng, published yesterday on the R Views blog. Key quote:

… people say that one of the differences between say Python or Julia and R is that R is a DSL for stats, whereas these other things are general purpose languages. R is not a DSL. It’s a language for writing DSLs, which is something that’s altogether more powerful.

Update (January 5, 2017): this other note cites additional technologies that are interesting to learn for data science.

  • First published on January 5th, 2017