Around two years ago, I was offered to teach R again at Sciences Po, in Paris, in a spirit close to the Stata-based course that I have been teaching there for over ten years.
I first taught R to social scientists in 2013, but had not repeated the experience since then, except through various short and often focused workshops. I almost got to teach such a course in 2017, just as RStudio Desktop was turning 1.0, but that course failed to materialize.
Many things have changed since 2013, and there is now much higher demand to teach R (and RStudio) to social science audiences. R and RStudio have improved a lot, and the tidyverse, which recently turned 2.0 while still changing a lot, has become a core component of most courses, including mine.
My own attempt to teach R, RStudio and the tidyverse in 2023 has been online for a few months, in the form of a GitHub repository with a few wiki pages, including a long list of readings, videos and Web links, and another list of other R courses.
I have also uploaded a tentative syllabus for the course:
The course has only run once so far, and there are many issues with it that I will try to fix in the coming months. The repository also misses some essential course items (the slides, and the solutions to the exercises), which I am however happy to share privately by email.
A cool aspect of the course is that another instructor, Kim Antunez, will be teaching her own fork of it in the next few weeks. Kim has invested a lot into turning the course into a full-fledged Quarto website, which I will share in a follow-up post once she is done building it.
My own way of teaching the course is more old-school, as I rely on weekly emails and a shared Google Drive folder. I will, however, put some effort in improving the slides and giving the course a Web page, in order to make it more fully and easily accessible online.
I feel that I already have enough material to assemble a more advanced R course for social scientists, but first need to streamline this introductory course a bit more, in order to make the reading list, especially, a bit more focused and manageable.
I also feel that there will soon be more changes to the tidyverse that I will have to take into account. I still, for instance, use the %>%
pipe for chain operations, whereas the current trend is to use the native |>
pipe, introduced in R version 4.1.0, whenever possible.
This note is tangentially related to my previous notes on teaching with RStudio, on R as a data science language and on other technologies for data science.
]]>I took some kind of a break from R over the past 18 months. I plan to change that this coming year, and have compiled the following list of things that I want to explore, or come back to.
While I have fully transitioned towards “tidy data” and its wonderful packages, I am still not the type of R user who works in R Markdown documents (notebooks) like David Robinson so brilliantly illustrates in his videos.
I hope to get there next year, because it makes sense from a reproducibility perspective.
Most of my quantitative methods courses involve either basic frequentist regression models, or models used on panel data by political scientists, who have imported a lot of their modelling standards from econometrics and political economy. The models and replication material that come with published articles are mostly coded in Stata.
One thing that I need to do this coming year is to check how easy (or difficult) it is, in 2021, to replicate those Stata-coded panel data models in R. This will involve looking mostly at cross-section time-series (CSTS) data, usually measured at the country level, and fitting some of the regression models available in Stata.
I have been wanting to upgrade my Web scraping skills for some time, and while my initial plan was to learn enough Python to switch to that language (and its excellent libraries) for Web scraping, I keep coming back to R due to lack of proper learning time and too little professional incentives to code in Python.
The specific thing that I will be coming back to is headless browsing.
Network models, and exponential random graph models (ERGMs) in particular, have improved a lot in the past few years. I have followed the literature at a distance, and need to dive into it again, especially for the part that focuses either on generalizing ERGMs beyond binary responses, or on taking time (temporal dependence) into account.
I received my copy of Gelman, Hill and Vehtari's Regression and Other Stories this summer, and do not want that book to end up with Harrell's Regression Modeling Strategies and McElreath's Statistical Rethinking on my list of books that I want to read, but might never end up doing so. (The list is much longer than that, and also has everything by Hastie and Tibshirani on it.)
Let's call this to-do item my annual attempt at transitioning further towards Bayesianism, which is made easy in R thanks to the rstanarm
package, to Bürkner's brms
package, to Harrell's rmsb
package, and McElreath's rethinking
package.
The tidymodels
framework and Julia Silge's videos offer a nice invitation to dive deeper into those things that many of us explored when "machine learning" was the keyword that any aspiring methodologist (or data scientist, or else) had to know something about.
Going back to learning about machine learning is something that I look forward to, and which I plan to do while looking again at Cosma Shalizi's course on data mining from 2019.
The to-do list above will have to finds its place alongside coding in Stata for one of my oldest courses, plus reading about many other things that have little to do with R or statistics Let's see if that will happen.
Update (December 14, 2020): corrected the authors of the brms
package, with thanks to Dieter Menne, who spotted and reported the error.
sample()
function has changed since R 3.6.0, and how to reproduce its previous behaviour.
A recent blog post by Christian Robert reminded me that R had to fix its sample()
function in R 3.6.0 and above.
The issue that used to affect the pseudo-random number generator (PRNG) at the core of the function is documented in a note by Kellie Ottoboni and Philip B. Stark, “Random problems with R,” which was extensively discussed on the R-devel mailing-list.
The note explains that (part of) the PRNG used by R 3.5.1 does not correct for the uneven spacing of binary floating-point numbers. The resulting quantization effect/error produces biased selection probabilities, to a sufficiently severe extent for the PRNG not to qualify as sufficiently pseudo-random.
The issue got patched in R 3.6.0, which led to the introduction of a method allowing to reproduce the former behaviour of the PRNG. The method is well documented on R blogs like Revolution Analytics or J. Kenneth Tay's Statistical Odds & Ends, and consists in adjusting the RNGKind
option before calling the sample()
function:
This is of course only useful if one depends on a particular random number generator and seed number, as set through set.seed
, to reproduce the behaviour and results of R code written and executed before R 3.6.0.
At that stage, you might be tempted to add generating random numbers to the short list of hard things in computer science—which would be, in my view, entirely correct.
This note is obviously of the ‘note to self’ kind. The previous one in that category (successfully) aimed at reminding me to use the RDS format.
Update (August 14, 2019): to change the local state of the PRNG without affecting its global state, see this note by Evgeni Chasnovski.
Update (August 14, 2019): thanks to R Weekly for mentioning this note.
]]>The data include 9 conferences, plus a forthcoming one. For each conference, the number of sponsors varies between 8 and 46, and the number of organising sponsors varies between 1 and 4. The dataset has 201 sponsors in total: make sure to check the codebook and coding notes for details.
Listing the most frequent sponsors is fairly instructive, as it shows the mix of public (academic or governmental) and private (commercial) institutions that make R conferences possible:
SFdS | nonprofit | 9 |
Danone | private | 5 |
INRA | public | 5 |
AMIES | public | 4 |
Capionis | private | 4 |
CNRS | public | 4 |
CRC Press | private | 4 |
GDR Stat Santé | public | 4 |
IA | public | 4 |
Lysis | private | 4 |
The private-sector sponsors, in particular, are interesting: the full sample includes mostly statistical consulting and training small businesses, as well as a few academic publishers and some industrial research units, including some large companies working on the life sciences (biology, health and nutrition), as well as energy and transport.
If one looks at the broader spectrum of scientific disciplines and domains covered by the sponsors, (applied) mathematics, statistics and computer science are, of course, well represented, as are the aforementioned life sciences, with a special mention to INRA and agricultural sciences.
Looking at this data, I would recommend to anyone who wants to spend their working lives doing statistical computing/programming (with R) to study either that directly through a mathematics or statistics degree, or to go into biology and study lots of biostatistics, bioinformatics and computational biology.
The types of occupations associated with R come, unsurprisingly, from three main domains: higher education and scientific research, software development, and statistical consulting and training. If you like to compute stuff and/or to teach how to do it, then R is clearly made for you!
Note that, while underrepresented among the sponsors and organising committees of the conferences listed in the data, R is also widespread in the social sciences, where it tends to gradually replace other statistical software like SPSS or Stata.
Also note that the data cover only (French) R conferences and therefore excludes many other relevant ‘computational science’ conferences. As a consequence, physics and other domains are underrepresented by construction, which would likely not occur if one repeated the same exercise on conferences like SciPy or SIGGRAPH.
]]>My Twitter feed is currently offering me lots of links to slides that are being presented at useR! 2019 in Toulouse. The hashtag is #useR2019
, and you can get most of the material through GitHub, where someone has, as is usual during such conferences, started to compile all the material: thank you, Suthira Owlarn.
So far, my favourite find in the material being presented is Dmytro Perepolkin's talk on, and package for, polite
scraping. I also plan to take a look at Timothée Giraud's cartography
(thematic maps), Anqi Fu and Balasubramanian Narasimhan's CVXR
(convex optimization), and Dianne Cook's tutorial on visualising high-dimensional data.
As far as I know, this is the second time that the useR! conference happens in France: the first time was ten years ago, in Rennes. The past events Web page of the useR! 2019 website gives a few more clues about other R conferences in France:
… most members of the organizing committee were previously involved in the organization of the Journées Françaises de Statistique in 2013 and in the French R meeting in 2016.
The Journées Françaises de Statistique is an event held by the Société Française de Statistique (SFdS), which has been sponsoring French R conferences for a long time, as have the Société Française de Biométrie (SFB) and several research organizations involved in disciplines including mathematics, computer science, agriculture and ecology.
The “French R meeting” mentioned in the quote above is called Rencontres R and has been happening since 2012. Since there does not seem to be a public listing of all its editions, here is my own index of their websites:
There are tons of local French R meetings; one that I remember vividly from a few years ago was called FL\tauR, and was attended and organised by lots of people from Insee, the French official statistics agency.
As of today, the only local groups that I keep an eye on are the R Addicts Paris Meetup group, the Semin-R conference and the RUSS (R à l’Usage des Sciences Sociales) seminar, all of which are located in Paris.
You will find many more links to French R conferences and groups on the frrrenchies Web page maintained by Paul-Antoine Chevalier and others.
The page lists many useful help resources for French speakers, such as the r-grrr Slack channel, but its most important section, to me, is the part where it lists R packages with specific relevance to users working on French (administrative, geographic, etc.) data.
You might also be interested in my note on French R conference sponsors.
Update (July 16, 2019): thanks as always to R Weekly for mentioning this note.
]]>For background, Stata is the first statistical software that I learnt in depth. I have since transitioned to R but still use Stata for teaching purposes, to check some models that I know how to code in both languages, or to replicate papers that use it, which is common in disciplines heavily influenced by econometrics like political science.
The irony of ‘rebooting’ my “R / Notes” blog with a note on Stata is not lost on me. Still, my intention here is to compare some recent aspects of the R and Stata languages, as both contain some information about the state of applied statistics in academia.
StataCorp releases a new version every two years, and its “what’s new” pages are often highly instructive with regard to what applied statisticians are using in the discipline cited above, as well as a few others like epidemiology.
Stata 15 was released in 2017. It introduced, among other things, wide support for Bayesian estimation, minimal dynamic reporting with Markdown and Microsoft Word, as well as other much-needed features like transparency in graphics and panel data cointegration tests.
Some of those features might have been available earlier as Stata packages from the Statistical Software Components (SSC) archive, the Stata equivalent of CRAN. However, their implementation into the ‘core’ of the Stata language guarantees faster and wider adoption as well as stronger long-term support, which is something that StataCorp does very well, just as it excels at writing extensive and intelligible documentation.
Stata is both a DSL and an IDE, and some of the least interesting but most important changes in Stata 16 affect the latter. Stata is used almost exclusively by users who rely on its IDE rather than on its command-line version or on an external editor, hence the importance of getting it right.
It might surprise R users, who also often use shell terminals and source code editors, but Stata only just introduced language autocompletion to its do-file (script) editor windows. While the improvement does not put the Stata IDE on par with RStudio, it is still very, very much welcome.
Another GUI change that will affect Mac users is native tab support, which brings things like viewing datasets or documentation in Stata closer to how they work in RStudio.
Note that I am comparing the Stata IDE to RStudio and not to the ‘default’ R GUI, partly out of compassion for the latter and also because I have not used it – or seen anyone use it – in years, except perhaps for one remarkable R user, who might well have now switched to other IDEs like RStudio or Jupyter Python notebooks.
Another feature of Stata that might surprise R users is that Stata handles only one dataset at a time: virtually all Stata commands (functions) run by the user are applied to (or make use of) that dataset.
This makes “manipulating your data” (broadly defined) in Stata very easy, since once the data are loaded in memory, there is no need to assign further changes to the data object. However, it makes anything from merging two datasets to storing scalars and strings much more difficult.
For a long time, R had the reverse issue: storing different objects has always been simple in R, but repeatedly editing the same object used to be tiresome to code. This has changed, of course, with the introduction of the %>%
forward-pipe operator.
Stata 16 features a frame
command that aims at allowing users to manipulate multiple datasets, hopefully in ways less clunky than the previous tricks and workarounds that all Stata users had to rely upon so far. I am personally only half-convinced by the implementation, but some users might benefit from it.
Stata now supports executing Python code in a very clean and unobtrusive way. The integration seems almost as seamless as what the reticulate
package provides in R.
While Python is probably the third most-used language in the academic settings that I work in, it is the most used language in other branches of academia, as well as in the parts of the industry that have given its effective meaning to ‘data science’.
Python’s excellent machine learning, natural text processing and Web scraping libraries are certainly driving its integration in many other languages, and hopefully also the broader integration of machine learning into quantitative scientific research.
Every new version of Stata introduces new models. Stata 16 adds lasso regression and heteroskedastic ordered probits, and most interesting to me, panel-data mixed logit and more ways to work with panel/multilevel data.
While Stata offers a vast choice of statistical models, either through ‘core’ Stata or through user-submitted packages, it certainly does not match the diversity of models available in R. To take an example, estimating exponential random graph models in Stata requires wrapping around their R implementation.
However, generally speaking, there is an upside to the restricted diversity of statistical models offered in Stata, one of them being that the standard way that Stata uses to ‘print out’ models on screen is far more consistent than it is across R packages.
Similarly, Stata is highly consistent in how it handles optional arguments to its model commands, e.g. to cluster standard errors. This, to me, is the kind of attention to the user that keeps people using the software along with the quasi-magical margins
command, which has been ported to R.
All of this affects me because, like many R users and like virtually every Stata user, I spend a lot of time working with statistical models, an area in which R has recently started to recover some consistency, thanks to the work of Max Kuhn, Alex Hayes and many others on ‘tidy models’.
In my view, Stata has never been very good at graphics: the Stata graph syntax is, like that of base R (and still in my view), rather verbose and inelegant. Graphics in Stata 16 are improving, but their syntax will almost certainly never outperform that of ggplot2
or even of lattice
on my (personal, opinionated) benchmarks.
Similarly, to me, exporting tabular results such as summary statistics or models has always been one of Stata’s great weaknesses, compensated only in part by Ben Jann’s invaluable estout
package.
So while I mentioned earlier that I find Stata’s way of printing model results to the user truly excellent, I find almost every other ‘side effect’ in Stata inefficient.
Stata 16, however, seems to be getting closer to offering an acceptable compromise: its improved put*
commands should make it much easier to get something like a regression table into something like a Microsoft Word document.
Beyond exporting models and other tables, Stata still lags well behind other languages, and R especially, when it comes to dynamic documents. R Markdown is miles ahead of what can be (easily) done in Stata, and recent frameworks like Distill will widen that gap in the very short term.
StataCorp's president William Gould has posted a more formal and exhaustive overview of what’s new in Stata 16 on the Stata Blog.
If you are interested in taking a look at my Stata teaching material, the code for the introductory applied statistics course that I teach to Masters students at Sciences Po in Paris is available on GitHub.
Update (July 2, 2019): thanks to R Weekly for mentioning this note.
]]>Anyone who has learnt a programming language has a history of questions that they have asked to:
The standard/traditional way of asking R-related questions has been through its mailing-lists, but other ways to ask (or answer) questions have become very popular.
Aside from being polite and open-minded, asking or answering programming questions online usually also requires providing a minimally reproducible example, or "Minimal Working Example", or "reprex" – which is the name of a very helpful R package initiated by Jennifer C. Bryan, whose good humour and open-mindedness, visible on her Twitter account and in her many conference talks, are also exemplary.
The "Getting Help with R" page describes both mailing-lists and Stack Overflow as the recommended places for R-related questions and answers:
Stack Overflow
Stack Overflow is a well organized and formatted site for help and discussions about programming. It has excellent searchability. Topics are tagged, and “r” is a very popular tag on the site with almost 150,000 questions (as of summer 2016). To go directly to R-related topics, visit http://stackoverflow.com/questions/tagged/r. For an example both of the value of the site’s organization and information that is very useful to R users, see “How to make a great R reproducible example?”, which is also mentioned above.
R Email Lists
The R Project maintains a number of subscription-based email lists for posing and answering questions about R, including the general R-help email list, the R-devel list for R code development, and R-package-devel list for developers of CRAN packages; lists for announcements about R and R packages; and a variety of more specialized lists. Before posing a question on one of these lists, please read the R mailing list instructions and the posting guide.
At least two R user communities also run forums, which are more practical to search and follow than mailing-lists, and which are not restricted to asking questions:
As mentioned in a previous note, Twitter is a great place to learn about recent R-related developments. It is also, in my view, a great place to interact with many R package developers, in a direct and inobtrusive way.
For longer conversations, opening issues on GitHub (or other code) repositories is probably more recommendable. GitHub issues host a wide variety of conversations, not just bug reports: reading them is often extremely informative.
A few more places might also be worth considering.
Although there is a dedicated Stack Exchange site for software recommendations, it might not be the best place to ask for (or read about) package recommendations. Perhaps Reddit might be a good place for those, and there are several R-focused subreddits that might suit one's needs:
I have also spotted the r/rshiny and r/ggplot2 subreddits, but those seem to be less active than those listed above, at least as of now.
Last, R also has some Slack channels, such as r-grrr (in French).
Update (July 31, 2017): thanks again to R Weekly for mentioning this note.
Update (February 3, 2019): added the section on community forums, added a mention of Slack channels, and reorganised other sections.
]]>The official announcement of the survey reads:
Please take the survey yourself and help us spread the word on social media, by word of mouth, and any other way you can think of. The survey will be live until September 15th.
In my own answers, I have made my best to stress what I believe are two core strengths of the R community as it exists today:
R is a lowly-centralised programming language: it has a list of core developers, known as "R Core", but it has no "benevolent dictator for life" like other programming languages. Instead, it has someone whom one might want to call a "benevolent contributor for life" in the person of Hadley Wickham, who undoubtedly deserves some kind of lifetime achievement from the R community for developing ggplot2 and the tidyverse (originally nicknamed the "Hadleyverse" by others).
Similarly, R has not one but many package archive networks, including of course CRAN, but also Bioconductor and GitHub, the latter of which brings the virtuous entropy of a place like the Amazonian forest to the R language. As one of my previous note should perhaps have made more obvious, I strongly believe that this 'anarchic' state of the R ecosystem is essential to its diversity and, in the end, good health. In fact, I believe this holds largely true in any complex system, including political systems.
Decentralisation has its costs: GitHub-hosted packages and alternative engines are places where one might easily inject malware, and diversity means that users get lots of (possibly redundant) choices where they might favour a more restricted set of options. But even those costs have positive externalities in the form of open source software vetting and the development of intelligent safeguards, such as sandboxing and company-monitored programming environments such as Microsoft R.
Another characteristic of the R community that only got a brief mention in a previous note is its humaneness, which I encompasses many qualities, including special attention to the tackling of gender, race, physical and sexual discriminations – to cite only a few – commonly encountered in other social environments.
Just like the R community lacks a (hopefully benevolent) dictator, R-Ladies Global is the kind of initiative that certainly lacks in many programming communities. For a programming language to find its fullest range of locutors and reap the benefits of cognitive diversity, foundations and consortia are not enough: support groups are necessary to enable participation.
Justice is a sufficient condition to support those initiatives, yet here also, there is a strong positive-externalities argument to be made for difference and diversity from the viewpoint of complex systems, with reference to the arguments of people like Scott E. Page or, from a more sociological angle, Rogers Brubaker.
The ideas outlined in the paragraphs are expressed in simplistic form and will very easily lend themselves to criticism. I suspect, however, that a longer discussion of cognition, diversity and efficiency would reach the exact same conclusions, contra engineering-style arguments rooted in cheap calls to eugenics, meritocracy, natural selection, and the so-called "optimisation" of social systems through similar processes.
It has been routinely observed, at conferences or elsewhere, that the R community includes many non-programmers. My hope is that this observation is true, and that it will stay so. Consequently, I hope that the responses to the R survey will ensure that the qualities required to maintain this state of affairs get proper representation in the future objectives and priorities of the R Consortium.
Side note: While writing this note, I was unable to find the name or online presence of the working group that focuses on inclusiveness in the R community, beyond gender diversity. I would appreciate if anyone could help me identify that group, in order to link to it from this note. Solved: the ~~group~~ task force is called R Forwards. Thanks to Olivia Brode-Roger for finding it.
Update (July 26, 2017): shortly after I published this note, Julia Silge released a set of slides that she presented, with co-authors, at useR! 2017. The presentation, titled "Navigating the R Package Universe," was initially titled "Navigating the R Package Jungle" – which fits well with my arguments above.
]]>Historically, the R Project for Statistical Computing has been supported by the R Foundation since its inception in 2002. It has laid down some of the most important building blocks of the R ecosystem, including, of course, CRAN, as well as the R Journal and the R mailing-lists.
Fifteen years later, many other organizations have been set up to help developing R and its user base, at various levels and through various means:
This list does not cover the smaller organizations, such as the recently created r-spatial group, which help developing R packages for a myriad of different applications with often very different audiences.
I would say that R has a pretty happy community right now. Getting help to use R is easier than ever, the quality of many new software releases is very high, and the user base is becoming more and more diverse, which is a huge (and indispensable) asset.
The next step might be to boost the job opportunities available to R users, and to better organise the ways that it is taught in universities, on online learning platforms like Coursera or DataCamp, or through private training firms.
Although there is no single way to keep track of everything going on in the R community, almost everything shows up on Twitter at some point, generally labelled with the #rstats
hashtag.
Go and explore, and happy new year!
Update (January 18, 2017): this experimental app makes it easy to explore all #rstats
tweets since 2015.
Update (January 23, 2017): thanks for the mention, R Weekly!
]]>A few years ago, John D. Cook gave a presentation called “The R Language: The Good The Bad & The Ugly,” in which he made an excellent series of points about R as
a domain-specific language for data analysis.
The entire presentation is worth watching. Below are a few highlights:
When I watched this presentation for the first time, I felt that it formulated all the important reasons for choosing R over another language to learn and then write up data-analytical code.
In hindsight, the only flaw that I find to John D. Cook's presentation is that it restricts the domain that R intends to cover as a domain-specific language. More specifically, Cook describes R as a statistical language, noting, for instance, that ‘books about R’ are generally books about statistics.
This characterisation of the R language is likely to have been adequate a few years ago, and it was certainly adequate to characterise the S language, from which R emerged. As of today, however, there are good reasons to consider that R has evolved to become a language about more than statistics.
I recently discovered the blog maintained by Joshua Ebner at Sharp Sight Labs, and as with the previous presentation, I got the feeling that the author was making all the important points about the R language.
Below is a selection of Sharp Sight Labs blog posts from the past two years, listed by reverse chronological order, and summarised through an important quote:
Why R is the best data science language to learn today
… I want to explain all of the reasons why I’m very optimistic about R’s long term prospects, and why I think it’s perhaps the best data science language to learn today.
Why you should master R (even if it might eventually become obsolete)
… data science is changing very fast, and any tool that you learn will eventually become obsolete.
How much data science do you actually remember?
... if you want to get a good data science job, you need to really know your stuff. You need to remember how to write the code, from memory, on command.
Stop trying to jump to the sexy stuff first
[Top performers] don’t demand to start with the cool stuff. Top performers diligently learn and master the foundations.
The real prerequisite for machine learning isn’t math, it’s data analysis
The reality is that in industry, data scientists just don’t do much higher level math.
But most data scientists do spend a huge amount of their time getting data, cleaning data, and exploring data. This applies both to data science generally, and machine learning specifically; and it particularly applies to beginners.
If you want to get started with machine learning, the real prerequisite skill that you need to learn is data analysis.
Why you should start by learning data visualization and manipulation
… learn data visualization first and then learn data manipulation.
How data visualizations are tools (and what you’re building with them)
… you should not approach learning data science from a software development point of view… You should start by learning how to find and deliver insights from data.
These blog posts do a brilliant job at explaining how to learn data science, and what to start with. The recommended language is R because its packages cover data collection, manipulation and visualization, machine learning, and statistics—all of which are building blocks of one might want to call “data science” in the current context.
Why is it important to clarify the domain covered by the R language?
In the short term, doing so serves two purposes:
First, to avoid misunderstandings.
As John D. Cook explains, R is not a general programming language, but a domain-specific language that has escaped its initial niche to cover a larger range of interests. In my opinion, this renders many (but not all) comparisons between R and other languages essentially irrelevant.
Second, to put R on the skills map.
We need expressions of what R is, and what it is useful for, to make it academically as well as professionally relevant, well beyond the scope of statistical science, into the vaster field of practice that is gradually being identified as “data science.”
In the longer term, knowing what R does should also help understanding what R is unlikely to become in the future, and correlatively, help the developer communities that maintain other programming languages or runtime environments to realise what they might be able to achieve by interfacing with R, or by developing software inspired by it.
Update (January 5, 2017): for a somewhat different perspective, see this interview of Joe Cheng, published yesterday on the R Views blog. Key quote:
… people say that one of the differences between say Python or Julia and R is that R is a DSL for stats, whereas these other things are general purpose languages. R is not a DSL. It’s a language for writing DSLs, which is something that’s altogether more powerful.
Update (January 5, 2017): this other note cites additional technologies that are interesting to learn for data science.
]]>