R / Notes

AI-generated code comes with security risks

2025-04-14T00:00:00+02:00

More and more students are using AI-generated code in their studies, without necessarily understanding the security risks that this entails. This has consequences for users such as students learning how to code in R.

How AI-generated code happens

Generative AI services such as ChatGPT use Large Language Models to generate computer code. These models are ‘trained’ against a dataset of publicly available code.

Many users of generative AI do not seem fully aware of what ‘publicly available code’ might contain, and therefore do not really seem aware, in turn, of the security risks that come with executing AI-generated code.

Using generative AI services in a learning environment such as academia raises many concerns. Security is only one of them.

Where the security risk lies

Programming languages like R are not sandboxed. This means that these languages can execute malicious instructions such as ‘erase every image on the hard drive of that laptop,’ or ‘replace all occurrences of “Jewish” in that text with “kike”.’

Sandboxing the execution of R code is possible, but this is not how R runs by default.

The risk is real and already active

Just like human languages have already been ‘poisoned’ in various ways, some of the computer code that make up the public codebase on which Large Language Models are trained has already been ‘poisoned’ in various ways.

One of the ways that this has happened is through software packages. It is very easy to bundle harmful or malicious code into a software package, and then to give it a name that resembles the name of a legitimate software package.

Executing R code that contains such a package will pose a security threat to the user, equivalent to that of opening emails or attachments sent by unknown sources. The consequences can be relatively innocuous, or extremely serious.

Both AI-generated code and inattentive users can be misled into referring to these harmful or malicious software packages into their own code. The vulnerability will be triggered when the code is executed.

This scenario is not a view of the future. It is already happening.

Real-world example threat

Even a user like myself, who has learnt how to code in R for research purposes, can very easily write up a malicious software package.

An example of such a software package might do the following:

Scan all text files on disk for credit card information
Hide that information in a website address
Automatically open a Web browser and point it to that address
Collect the credit card information server-side
Delete as many files as possible on the hard drive

The steps above can be executed without the user noticing at all, or might execute in part or in full before the user can stop them from happening.

Privacy and security breaches of the sort are very easy to implement, and have been implemented in virtually all programming languages.

The risk is of course not limited to AI-generated code. Executing computer code from any untrusted source can lead to the same issues.

How to minimise the risk

R users should always check where their packages come from.

R users who use AI-generated code should be even more careful, and should also warn other users that their code was at least in part AI-generated.

It goes without saying that I have never, and will never, design the kind of attack described in this note.

Making R work in government

2024-09-29T00:00:00+02:00

This year's Ihaka Lecture is about making R work in government. It was delivered by Peter Ellis, the Director of the Statistics for Development Division at the Pacific Community (SPC).

A lot of the talk is based on very direct comparisons between R and other software:

The purpose of these comparisons is often to assert that R can do many forms of government analytics better than other software such as a spreadsheet editor (typically, Microsoft Excel). In this case, R replaces the older solution.

However, the talk also provides multiple examples of situations where R will have to be articulated with other tools, such as SQL or JavaScript:

In these cases, R integrates with the solutions in place. This is a very likely scenario in most organizations, and one that should be given a lot of attention to when teaching R to anyone who is already immersed in an analytics workflow such as government analytics.

igraph 2.0.0

2024-06-13T00:00:00+02:00

The igraph R package has reached version 2.0.0.

The igraph package is based on a C library, which is now fully available under the newer versions of the package:

This major release brings development in line with the igraph C library. Version 1.6.0 of the R package used version 0.9.10 of the C core. The changes in the 0.10 series of the C core are now taken up in version 2.0 of the R package.

There are a few breaking changes, but not that many.

A lot of work has been done upstream to have the authors of packages that use igraph to update their own code if needed. Thanks to the help of Kirill Müller, I recently updated my own ggnetwork package to that effect.

The igraph package has arguably the same stature in the R software ecology as the networkx package has in the Python software ecology. It is great news for the R community that it has made such progress in the recent months/years.

A security issue with R serialization

2024-05-24T00:00:00+02:00

A security issue has been found with how the R language serializes objects, and patched since.

The security issue is documented under CVE-2024-27322. It affects the serialization functions that were advertised in an earlier note.

The R Core Team recently reported that the issue has been fixed as of R 4.4.0, and that ‘any attack vector associated with it has been removed.’

This episode is a reminder that R is a programming language, and as such, that it raises the same security concerns as any other programming language.

Slightly over a decade ago, these concerns led Jeroen Ooms to develop the RAppArmor package, in order to enable users to restrict the execution environment of R through dynamic sandboxing.

Update (May 28, 2024): thanks to R Weekly for mentioning this note.

Data Science with R (and RStudio)

2023-08-23T00:00:00+02:00

This blog has been silent for a while, and the Covid-19 pandemic has forced me to ditch my R to-do list for 2021. I did, however, manage to assemble a few R-related things in the past couple of years. This note documents the main one, a Data Science with R (and RStudio) course aimed at social scientists.

Historical side note

Around two years ago, I was offered to teach R again at Sciences Po, in Paris, in a spirit close to the Stata-based course that I have been teaching there for over ten years.

I first taught R to social scientists in 2013, but had not repeated the experience since then, except through various short and often focused workshops. I almost got to teach such a course in 2017, just as RStudio Desktop was turning 1.0, but that course failed to materialize.

Many things have changed since 2013, and there is now much higher demand to teach R (and RStudio) to social science audiences. R and RStudio have improved a lot, and the tidyverse, which recently turned 2.0 while still changing a lot, has become a core component of most courses, including mine.

Teaching material

My own attempt to teach R, RStudio and the tidyverse in 2023 has been online for a few months, in the form of a GitHub repository with a few wiki pages, including a long list of readings, videos and Web links, and another list of other R courses.

I have also uploaded a tentative syllabus for the course:

The course has only run once so far, and there are many issues with it that I will try to fix in the coming months. The repository also misses some essential course items (the slides, and the solutions to the exercises), which I am however happy to share privately by email.

A cool aspect of the course is that another instructor, Kim Antunez, will be teaching her own fork of it in the next few weeks. Kim has invested a lot into turning the course into a full-fledged Quarto website, which I will share in a follow-up post once she is done building it.

My own way of teaching the course is more old-school, as I rely on weekly emails and a shared Google Drive folder. I will, however, put some effort in improving the slides and giving the course a Web page, in order to make it more fully and easily accessible online.

Going forward

I feel that I already have enough material to assemble a more advanced R course for social scientists, but first need to streamline this introductory course a bit more, in order to make the reading list, especially, a bit more focused and manageable.

I also feel that there will soon be more changes to the tidyverse that I will have to take into account. I still, for instance, use the %>% pipe for chain operations, whereas the current trend is to use the native |> pipe, introduced in R version 4.1.0, whenever possible.

This note is tangentially related to my previous notes on teaching with RStudio, on R as a data science language and on other technologies for data science.

A personal R to-do list for 2021

2020-12-13T00:00:00+01:00

This note lists the main things that I will be doing with R next year.

I took some kind of a break from R over the past 18 months. I plan to change that this coming year, and have compiled the following list of things that I want to explore, or come back to.

R Markdown

While I have fully transitioned towards “tidy data” and its wonderful packages, I am still not the type of R user who works in R Markdown documents (notebooks) like David Robinson so brilliantly illustrates in his videos.

I hope to get there next year, because it makes sense from a reproducibility perspective.

Panel models

Most of my quantitative methods courses involve either basic frequentist regression models, or models used on panel data by political scientists, who have imported a lot of their modelling standards from econometrics and political economy. The models and replication material that come with published articles are mostly coded in Stata.

One thing that I need to do this coming year is to check how easy (or difficult) it is, in 2021, to replicate those Stata-coded panel data models in R. This will involve looking mostly at cross-section time-series (CSTS) data, usually measured at the country level, and fitting some of the regression models available in Stata.

Web scraping

I have been wanting to upgrade my Web scraping skills for some time, and while my initial plan was to learn enough Python to switch to that language (and its excellent libraries) for Web scraping, I keep coming back to R due to lack of proper learning time and too little professional incentives to code in Python.

The specific thing that I will be coming back to is headless browsing.

Network models

Network models, and exponential random graph models (ERGMs) in particular, have improved a lot in the past few years. I have followed the literature at a distance, and need to dive into it again, especially for the part that focuses either on generalizing ERGMs beyond binary responses, or on taking time (temporal dependence) into account.

Bayesian models

I received my copy of Gelman, Hill and Vehtari's Regression and Other Stories this summer, and do not want that book to end up with Harrell's Regression Modeling Strategies and McElreath's Statistical Rethinking on my list of books that I want to read, but might never end up doing so. (The list is much longer than that, and also has everything by Hastie and Tibshirani on it.)

Let's call this to-do item my annual attempt at transitioning further towards Bayesianism, which is made easy in R thanks to the rstanarm package, to Bürkner's brms package, to Harrell's rmsb package, and McElreath's rethinking package.

Machine learning

The tidymodels framework and Julia Silge's videos offer a nice invitation to dive deeper into those things that many of us explored when "machine learning" was the keyword that any aspiring methodologist (or data scientist, or else) had to know something about.

Going back to learning about machine learning is something that I look forward to, and which I plan to do while looking again at Cosma Shalizi's course on data mining from 2019.

The to-do list above will have to finds its place alongside coding in Stata for one of my oldest courses, plus reading about many other things that have little to do with R or statistics Let's see if that will happen.

Update (December 14, 2020): corrected the authors of the brms package, with thanks to Dieter Menne, who spotted and reported the error.

Remember the change in the sample() function of R 3.6.0

2019-08-07T00:00:00+02:00

This note documents how the sample() function has changed since R 3.6.0, and how to reproduce its previous behaviour.

A recent blog post by Christian Robert reminded me that R had to fix its sample() function in R 3.6.0 and above.

The issue that used to affect the pseudo-random number generator (PRNG) at the core of the function is documented in a note by Kellie Ottoboni and Philip B. Stark, “Random problems with R,” which was extensively discussed on the R-devel mailing-list.

The note explains that (part of) the PRNG used by R 3.5.1 does not correct for the uneven spacing of binary floating-point numbers. The resulting quantization effect/error produces biased selection probabilities, to a sufficiently severe extent for the PRNG not to qualify as sufficiently pseudo-random.

The issue got patched in R 3.6.0, which led to the introduction of a method allowing to reproduce the former behaviour of the PRNG. The method is well documented on R blogs like Revolution Analytics or J. Kenneth Tay's Statistical Odds & Ends, and consists in adjusting the RNGKind option before calling the sample() function:

This is of course only useful if one depends on a particular random number generator and seed number, as set through set.seed, to reproduce the behaviour and results of R code written and executed before R 3.6.0.

At that stage, you might be tempted to add generating random numbers to the short list of hard things in computer science—which would be, in my view, entirely correct.

This note is obviously of the ‘note to self’ kind. The previous one in that category (successfully) aimed at reminding me to use the RDS format.

Update (August 14, 2019): to change the local state of the PRNG without affecting its global state, see this note by Evgeni Chasnovski.

Update (August 14, 2019): thanks to R Weekly for mentioning this note.

French R conference sponsors

2019-07-13T00:00:00+02:00

As a complement to the previous note, I have collected every single entity that sponsored a national R conference in France over the last decade.

The data include 9 conferences, plus a forthcoming one. For each conference, the number of sponsors varies between 8 and 46, and the number of organising sponsors varies between 1 and 4. The dataset has 201 sponsors in total: make sure to check the codebook and coding notes for details.

Quick observations

Listing the most frequent sponsors is fairly instructive, as it shows the mix of public (academic or governmental) and private (commercial) institutions that make R conferences possible:


SFdS	nonprofit	9
Danone	private	5
INRA	public	5
AMIES	public	4
Capionis	private	4
CNRS	public	4
CRC Press	private	4
GDR Stat Santé	public	4
IA	public	4
Lysis	private	4

The private-sector sponsors, in particular, are interesting: the full sample includes mostly statistical consulting and training small businesses, as well as a few academic publishers and some industrial research units, including some large companies working on the life sciences (biology, health and nutrition), as well as energy and transport.

If one looks at the broader spectrum of scientific disciplines and domains covered by the sponsors, (applied) mathematics, statistics and computer science are, of course, well represented, as are the aforementioned life sciences, with a special mention to INRA and agricultural sciences.

Looking at this data, I would recommend to anyone who wants to spend their working lives doing statistical computing/programming (with R) to study either that directly through a mathematics or statistics degree, or to go into biology and study lots of biostatistics, bioinformatics and computational biology.

The types of occupations associated with R come, unsurprisingly, from three main domains: higher education and scientific research, software development, and statistical consulting and training. If you like to compute stuff and/or to teach how to do it, then R is clearly made for you!

Cautionary notes

Note that, while underrepresented among the sponsors and organising committees of the conferences listed in the data, R is also widespread in the social sciences, where it tends to gradually replace other statistical software like SPSS or Stata.

Also note that the data cover only (French) R conferences and therefore excludes many other relevant ‘computational science’ conferences. As a consequence, physics and other domains are underrepresented by construction, which would likely not occur if one repeated the same exercise on conferences like SciPy or SIGGRAPH.

useR! 2019 and R for French users

2019-07-12T00:00:00+02:00

This note celebrates useR! 2019 in Toulouse by listing a few links about R conferences in France and some resources for French R users.

My Twitter feed is currently offering me lots of links to slides that are being presented at useR! 2019 in Toulouse. The hashtag is #useR2019, and you can get most of the material through GitHub, where someone has, as is usual during such conferences, started to compile all the material: thank you, Suthira Owlarn.

So far, my favourite find in the material being presented is Dmytro Perepolkin's talk on, and package for, polite scraping. I also plan to take a look at Timothée Giraud's cartography (thematic maps), Anqi Fu and Balasubramanian Narasimhan's CVXR (convex optimization), and Dianne Cook's tutorial on visualising high-dimensional data.

Other French R meetings

As far as I know, this is the second time that the useR! conference happens in France: the first time was ten years ago, in Rennes. The past events Web page of the useR! 2019 website gives a few more clues about other R conferences in France:

… most members of the organizing committee were previously involved in the organization of the Journées Françaises de Statistique in 2013 and in the French R meeting in 2016.

The Journées Françaises de Statistique is an event held by the Société Française de Statistique (SFdS), which has been sponsoring French R conferences for a long time, as have the Société Française de Biométrie (SFB) and several research organizations involved in disciplines including mathematics, computer science, agriculture and ecology.

The “French R meeting” mentioned in the quote above is called Rencontres R and has been happening since 2012. Since there does not seem to be a public listing of all its editions, here is my own index of their websites:

Local French R meetings

There are tons of local French R meetings; one that I remember vividly from a few years ago was called FL\tauR, and was attended and organised by lots of people from Insee, the French official statistics agency.

As of today, the only local groups that I keep an eye on are the R Addicts Paris Meetup group, the Semin-R conference and the RUSS (R à l’Usage des Sciences Sociales) seminar, all of which are located in Paris.

R resources for French users

You will find many more links to French R conferences and groups on the frrrenchies Web page maintained by Paul-Antoine Chevalier and others.

The page lists many useful help resources for French speakers, such as the r-grrr Slack channel, but its most important section, to me, is the part where it lists R packages with specific relevance to users working on French (administrative, geographic, etc.) data.

You might also be interested in my note on French R conference sponsors.

Update (July 16, 2019): thanks as always to R Weekly for mentioning this note.

Stata 16, as seen from R

2019-06-28T00:00:00+02:00

This note takes a look at some of the new features in Stata 16, which was released this month, and compares those to their R equivalents.

Some personal background

For background, Stata is the first statistical software that I learnt in depth. I have since transitioned to R but still use Stata for teaching purposes, to check some models that I know how to code in both languages, or to replicate papers that use it, which is common in disciplines heavily influenced by econometrics like political science.

The irony of ‘rebooting’ my “R / Notes” blog with a note on Stata is not lost on me. Still, my intention here is to compare some recent aspects of the R and Stata languages, as both contain some information about the state of applied statistics in academia.

Some background on Stata

StataCorp releases a new version every two years, and its “what’s new” pages are often highly instructive with regard to what applied statisticians are using in the discipline cited above, as well as a few others like epidemiology.

Stata 15 was released in 2017. It introduced, among other things, wide support for Bayesian estimation, minimal dynamic reporting with Markdown and Microsoft Word, as well as other much-needed features like transparency in graphics and panel data cointegration tests.

Some of those features might have been available earlier as Stata packages from the Statistical Software Components (SSC) archive, the Stata equivalent of CRAN. However, their implementation into the ‘core’ of the Stata language guarantees faster and wider adoption as well as stronger long-term support, which is something that StataCorp does very well, just as it excels at writing extensive and intelligible documentation.

New in the Stata 16 IDE

Stata is both a DSL and an IDE, and some of the least interesting but most important changes in Stata 16 affect the latter. Stata is used almost exclusively by users who rely on its IDE rather than on its command-line version or on an external editor, hence the importance of getting it right.

It might surprise R users, who also often use shell terminals and source code editors, but Stata only just introduced language autocompletion to its do-file (script) editor windows. While the improvement does not put the Stata IDE on par with RStudio, it is still very, very much welcome.

Another GUI change that will affect Mac users is native tab support, which brings things like viewing datasets or documentation in Stata closer to how they work in RStudio.

Note that I am comparing the Stata IDE to RStudio and not to the ‘default’ R GUI, partly out of compassion for the latter and also because I have not used it – or seen anyone use it – in years, except perhaps for one remarkable R user, who might well have now switched to other IDEs like RStudio or Jupyter Python notebooks.

Multiple data frames

Another feature of Stata that might surprise R users is that Stata handles only one dataset at a time: virtually all Stata commands (functions) run by the user are applied to (or make use of) that dataset.

This makes “manipulating your data” (broadly defined) in Stata very easy, since once the data are loaded in memory, there is no need to assign further changes to the data object. However, it makes anything from merging two datasets to storing scalars and strings much more difficult.

For a long time, R had the reverse issue: storing different objects has always been simple in R, but repeatedly editing the same object used to be tiresome to code. This has changed, of course, with the introduction of the %>% forward-pipe operator.

Stata 16 features a frame command that aims at allowing users to manipulate multiple datasets, hopefully in ways less clunky than the previous tricks and workarounds that all Stata users had to rely upon so far. I am personally only half-convinced by the implementation, but some users might benefit from it.

Python support

Stata now supports executing Python code in a very clean and unobtrusive way. The integration seems almost as seamless as what the reticulate package provides in R.

While Python is probably the third most-used language in the academic settings that I work in, it is the most used language in other branches of academia, as well as in the parts of the industry that have given its effective meaning to ‘data science’.

Python’s excellent machine learning, natural text processing and Web scraping libraries are certainly driving its integration in many other languages, and hopefully also the broader integration of machine learning into quantitative scientific research.

Statistical models

Every new version of Stata introduces new models. Stata 16 adds lasso regression and heteroskedastic ordered probits, and most interesting to me, panel-data mixed logit and more ways to work with panel/multilevel data.

While Stata offers a vast choice of statistical models, either through ‘core’ Stata or through user-submitted packages, it certainly does not match the diversity of models available in R. To take an example, estimating exponential random graph models in Stata requires wrapping around their R implementation.

However, generally speaking, there is an upside to the restricted diversity of statistical models offered in Stata, one of them being that the standard way that Stata uses to ‘print out’ models on screen is far more consistent than it is across R packages.

Similarly, Stata is highly consistent in how it handles optional arguments to its model commands, e.g. to cluster standard errors. This, to me, is the kind of attention to the user that keeps people using the software along with the quasi-magical margins command, which has been ported to R.

All of this affects me because, like many R users and like virtually every Stata user, I spend a lot of time working with statistical models, an area in which R has recently started to recover some consistency, thanks to the work of Max Kuhn, Alex Hayes and many others on ‘tidy models’.

Graphics, tables and reports

In my view, Stata has never been very good at graphics: the Stata graph syntax is, like that of base R (and still in my view), rather verbose and inelegant. Graphics in Stata 16 are improving, but their syntax will almost certainly never outperform that of ggplot2 or even of lattice on my (personal, opinionated) benchmarks.

Similarly, to me, exporting tabular results such as summary statistics or models has always been one of Stata’s great weaknesses, compensated only in part by Ben Jann’s invaluable estout package.

So while I mentioned earlier that I find Stata’s way of printing model results to the user truly excellent, I find almost every other ‘side effect’ in Stata inefficient.

Stata 16, however, seems to be getting closer to offering an acceptable compromise: its improved put* commands should make it much easier to get something like a regression table into something like a Microsoft Word document.

Beyond exporting models and other tables, Stata still lags well behind other languages, and R especially, when it comes to dynamic documents. R Markdown is miles ahead of what can be (easily) done in Stata, and recent frameworks like Distill will widen that gap in the very short term.

StataCorp's president William Gould has posted a more formal and exhaustive overview of what’s new in Stata 16 on the Stata Blog.

If you are interested in taking a look at my Stata teaching material, the code for the introductory applied statistics course that I teach to Masters students at Sciences Po in Paris is available on GitHub.

Update (July 2, 2019): thanks to R Weekly for mentioning this note.