Stata 16, as seen from R

This note takes a look at some of the new features in Stata 16, which was released this month, and compares those to their R equivalents.

Some personal background

For background, Stata is the first statistical software that I learnt in depth. I have since transitioned to R but still use Stata for teaching purposes, to check some models that I know how to code in both languages, or to replicate papers that use it, which is common in disciplines heavily influenced by econometrics like political science.

The irony of ‘rebooting’ my “R / Notes” blog with a note on Stata is not lost on me. Still, my intention here is to compare some recent aspects of the R and Stata languages, as both contain some information about the state of applied statistics in academia.

Some background on Stata

StataCorp releases a new version every two years, and its “what’s new” pages are often highly instructive with regard to what applied statisticians are using in the discipline cited above, as well as a few others like epidemiology.

Stata 15 was released in 2017. It introduced, among other things, wide support for Bayesian estimation, minimal dynamic reporting with Markdown and Microsoft Word, as well as other much-needed features like transparency in graphics and panel data cointegration tests.

Some of those features might have been available earlier as Stata packages from the Statistical Software Components (SSC) archive, the Stata equivalent of CRAN. However, their implementation into the ‘core’ of the Stata language guarantees faster and wider adoption as well as stronger long-term support, which is something that StataCorp does very well, just as it excels at writing extensive and intelligible documentation.

New in the Stata 16 IDE

Stata is both a DSL and an IDE, and some of the least interesting but most important changes in Stata 16 affect the latter. Stata is used almost exclusively by users who rely on its IDE rather than on its command-line version or on an external editor, hence the importance of getting it right.

It might surprise R users, who also often use shell terminals and source code editors, but Stata only just introduced language autocompletion to its do-file (script) editor windows. While the improvement does not put the Stata IDE on par with RStudio, it is still very, very much welcome.

Another GUI change that will affect Mac users is native tab support, which brings things like viewing datasets or documentation in Stata closer to how they work in RStudio.

Note that I am comparing the Stata IDE to RStudio and not to the ‘default’ R GUI, partly out of compassion for the latter and also because I have not used it – or seen anyone use it – in years, except perhaps for one remarkable R user, who might well have now switched to other IDEs like RStudio or Jupyter Python notebooks.

Multiple data frames

Another feature of Stata that might surprise R users is that Stata handles only one dataset at a time: virtually all Stata commands (functions) run by the user are applied to (or make use of) that dataset.

This makes “manipulating your data” (broadly defined) in Stata very easy, since once the data are loaded in memory, there is no need to assign further changes to the data object. However, it makes anything from merging two datasets to storing scalars and strings much more difficult.

For a long time, R had the reverse issue: storing different objects has always been simple in R, but repeatedly editing the same object used to be tiresome to code. This has changed, of course, with the introduction of the %>% forward-pipe operator.

Stata 16 features a frame command that aims at allowing users to manipulate multiple datasets, hopefully in ways less clunky than the previous tricks and workarounds that all Stata users had to rely upon so far. I am personally only half-convinced by the implementation, but some users might benefit from it.

Python support

Stata now supports executing Python code in a very clean and unobtrusive way. The integration seems almost as seamless as what the reticulate package provides in R.

While Python is probably the third most-used language in the academic settings that I work in, it is the most used language in other branches of academia, as well as in the parts of the industry that have given its effective meaning to ‘data science’.

Python’s excellent machine learning, natural text processing and Web scraping libraries are certainly driving its integration in many other languages, and hopefully also the broader integration of machine learning into quantitative scientific research.

Statistical models

Every new version of Stata introduces new models. Stata 16 adds lasso regression and heteroskedastic ordered probits, and most interesting to me, panel-data mixed logit and more ways to work with panel/multilevel data.

While Stata offers a vast choice of statistical models, either through ‘core’ Stata or through user-submitted packages, it certainly does not match the diversity of models available in R. To take an example, estimating exponential random graph models in Stata requires wrapping around their R implementation.

However, generally speaking, there is an upside to the restricted diversity of statistical models offered in Stata, one of them being that the standard way that Stata uses to ‘print out’ models on screen is far more consistent than it is across R packages.

Similarly, Stata is highly consistent in how it handles optional arguments to its model commands, e.g. to cluster standard errors. This, to me, is the kind of attention to the user that keeps people using the software along with the quasi-magical margins command, which has been ported to R.

All of this affects me because, like many R users and like virtually every Stata user, I spend a lot of time working with statistical models, an area in which R has recently started to recover some consistency, thanks to the work of Max Kuhn, Alex Hayes and many others on ‘tidy models’.

Graphics, tables and reports

In my view, Stata has never been very good at graphics: the Stata graph syntax is, like that of base R (and still in my view), rather verbose and inelegant. Graphics in Stata 16 are improving, but their syntax will almost certainly never outperform that of ggplot2 or even of lattice on my (personal, opinionated) benchmarks.

Similarly, to me, exporting tabular results such as summary statistics or models has always been one of Stata’s great weaknesses, compensated only in part by Ben Jann’s invaluable estout package.

So while I mentioned earlier that I find Stata’s way of printing model results to the user truly excellent, I find almost every other ‘side effect’ in Stata inefficient.

Stata 16, however, seems to be getting closer to offering an acceptable compromise: its improved put* commands should make it much easier to get something like a regression table into something like a Microsoft Word document.

Beyond exporting models and other tables, Stata still lags well behind other languages, and R especially, when it comes to dynamic documents. R Markdown is miles ahead of what can be (easily) done in Stata, and recent frameworks like Distill will widen that gap in the very short term.

StataCorp's president William Gould has posted a more formal and exhaustive overview of what’s new in Stata 16 on the Stata Blog.

If you are interested in taking a look at my Stata teaching material, the code for the introductory applied statistics course that I teach to Masters students at Sciences Po in Paris is available on GitHub.

Update (July 2, 2019): thanks to R Weekly for mentioning this note.

First published on June 28th, 2019

Other notes