R / Notes

Current views on generative AI

2026-03-14T00:00:00+01:00

This post contains my current views on generative artificial intelligence, and Large Language Models in particular. The context is mostly academia, which is about research and teaching.

Personal context

Generative AI is slowly creeping into my professional workflow, not because I am using it myself (I don't, although I guess that I will, at some point), but because everyone around me is.

My students use ChatGPT and other tools like, I believe, NotebookLM and Perplexity Comet. My RSS news feed (that's how old I am) recently had an article on Claude, and I use use Google applications, so I keep getting passively-aggressively asked to use Gemini, which I might do one day via Scholar Labs.

My workplace, which is a university, has taken a very basic stance on generative AI: unless stated otherwise, students are to follow LSE Position 1 (no use of generative AI in graded work), which I suppose goes both ways (no use of generative AI in grading, either).

I do not know of any equivalent position on generative AI in research. It seems like everyone wants to discuss the topic and play around with whatever is available for free online, but no one wants to make hard decisions about it yet, possibly due to upcoming EU-level regulations.

Risks for teaching and learning

From a teaching perspective, generative AI is only useful to me if it helps students going through the following process:

Learn
Draft
Revise
Submit
Defend

Part of what I teach is code, and code is the topic of this blog. As it happens, generative AI is already very good with code, and I am confident that it can be put to good use to go through Steps 1--3 of the process above.

There are, however, at least four reasons why I am currently taking ‘LSE Position 1’ on using generative AI in graded work that relies on code:

Many students are using AI to bypass the learning process, rather than enhance it. This creates security risks, and violates academic ethics in the same way that hiring an external party would. This comes on top of other breaches of students ethics, such as plagiarism.
The two issues mentioned in the previous point cannot be defended against at my level, at least not with my current resources. I can spot security risks, but I cannot reliably detect AI-generated code, which is neither watermarked or scannable through anti-plagiarism tools.
The software that I use in class is mostly open-source, and reproducibility is part of the core principles that I teach in class. As far as I understand, and unless proven otherwise, the kind of generative AI technology used by my students does not enforce these principles.
To make things worse, most generative AI also violates intellectual property (Ahmed et al. 2026), rather than reconfigure it around the ‘copyleft’ and ‘creative commons’ principles that many of us have spent years pushing within fields such as academic publishing.

I have not been exposed to any argument that makes any attempt at solving the ethical, logistical, moral and eventually legal issues that I have outlined above. Until I do, I will treat generative AI as a form of doping, and will keep banning it.

The analogy above with doping is not an innocent one. There is, in my view, a very real rhetorical arc that goes from generative AI to the Enhanced Games. Higher education does not approve of students taking Adderall, and neither do I.

Risks for scientific research

From a research perspective, generative AI is only useful to me if it helps me going through the following process:

Compile existing evidence
Collect meaningful data
Produce meaningful measures
Formulate correct interpretations
Enhance existing knowledge

There is no doubt that generative AI can help with every step above, especially perhaps at the level of data collection and, in the case of ‘big data’ or whatever people call it today, classification. I am also very interested in what it can contribute with regards to compiling scientific studies, in the same way that it is already helping with mathematical problems.

The risks that I have heard about so far when it comes to generative AI and social science research (which is what I do) are the following:

Generative AI can poison the evidence base (Bail 2024) through the mass production of low-quality academic output, or by compromising data such as online surveys (Westwood 2025, Westwood and Frederick 2026). This is already happening.
Generative AI does not yet produce reliable data annotations for the kind of data that I am interested in (Yang et al. 2025), and even if its coding reliability improves, it will require additional effort to mitigate related issues (Baumann et al. 2025).
Relatedly, generative AI cannot improve organically if it maintains its human bias towards evidence produced in the Global North (Ramirez-Ruiz and Senninger 2025), mostly by ‘WEIRD’ individuals (Atari et al. 2023). This will be hard and slow to solve.
Last but not least, generative AI will be used to erode scientific authority at the profit of those who are interested in attacking the contribution that scientific (and higher education) institutions make to society. This is of course far from a trivial issue.

The issues listed are all real, hard to solve, and are controversial insofar as some people have a vested interest in seeing them not addressed, at least not in the short term.

None of these issues will stop me from installing and trying out ellmer one day. However, I do expect this to happen within a scientific environment that will have acknowledged each issue in one way or another, and formulated guidelines to address them.

Are we there yet?

This post was inspired by the /ai ‘manifesto’, which I discovered thanks to Andrew Heiss. I obtained some of the cited references through Jessica Hullman's ‘New course on generative AI for behavioral science’ blog post.

Update (April 4, 2026): added the reference to Ahmed et al. (2026) after reading Natalie Hogg's blog post on LLMs, which I found via Minas Karamanis' blog post on the same topic (which I found via Cosma Shalizi and Kieran Healy). I also wrote a short introduction to this post on my other blog (in French).

Tutorials in Applied Statistics with R (and RStudio)

2026-02-27T00:00:00+01:00

This note documents my ongoing Tutorials in Applied Statistics with R (and RStudio), which are aimed at first-year undergraduate social scientists.

Three years ago, I published a Data Science with R course that has gone through a few iterations since then.

This year, I started teaching a short series of eight tutorials that cover more or less the same ground, although the audience is now first-year undergraduate (instead of postgraduate) students in political science.

The tutorials come with a lecture, taught by two colleagues (I teach two slightly different versions of the tutorials for each of them), which makes up time to focus on R + RStudio in class.

Teaching material

As usual, every bit of the course that I feel comfortable posting online has been published as a GitHub repository.

Some of the material repeats examples used in other courses (including a ‘failed’ course that I was forced to teach last year), but many if not most of the material is brand new.

I have also worked on three documents, which are linked to in the README file of the repository, and which are hosted on Google Docs:

What to learn and to revise for the tutorials
Survey research project instructions (if relevant)
Troubleshooting

The second document, in particular, is very much connected to how the course is assessed in one of the versions of the course. The other version uses class exams, which I cannot publish beyond the 'mock exam' that is already online.

I might post the tutorial slides one day, if I manage to purge them from links that lead to student information, and can send the exercise solutions to whoever wants them.

Possible improvements

This is very much version ‘0.x’ of the tutorials. I might try to improve a few things in the future. If the course could change its name to something for specific to social scientists, I would also welcome that.

As I wrote somewhere in the README file, the ultimate goal of the tutorials is to cover more or less the same content as Matthew Blackwell does in his Data Analysis and Politics course at Harvard University.

This year will be my 13th year teaching R and RStudio. Time flies!

sfReapportion

2025-12-26T00:00:00+01:00

This note documents the release of the sfReapportion package, which performs areal-weighted interpolation on spatial objects such as census tracts and voting districts.

A colleague of mine recently shared some code for a research project on the upcoming municipal elections in France, but the code required the spReapportion package, which has been hard to install and use for a few years, due to some of its dependencies, maptools and rgeos, having been retired in favour of the sf package.

The spReapportion package, which performs areal-weighted interpolation, was coded by a friend of mine. I decided to port his package in order for it to lose its retired dependencies, and to have it accept sf objects as well as sp ones. The result is available on CRAN and on GitHub as the sfReapportion package.

In parallel, I rewrote my other colleague's code in order to use that new package and to perform several other improvements. The first set of maps shown below come from early results obtained with that code, which is also on GitHub.

Rationale

In France as in many if not most other countries, the census tracts, which are called IRIS, are spatially incongruous with voting districts. If one wants to use data collected at the tract-level with voting data collected at the district-level, then one first has to interpolate/reapportion that data to the spatial boundaries of voting districts.

The two maps below show the polling stations (or bureaux de vote, in French) of the city of Lille, the boundaries of which have been stable for several years. Each map shows the results of a distinct principal components analysis, followed by a hierarchal clustering of its principal components.

The map on the left is the interesting one. The data used for the underlying principal components analysis come from the French official statistics agency, Insee, which publishes that data at the tract-level. The data were reapportioned with sfReapportion in order to coincide with the boundaries of the voting districts.

Features

The spReapportion package can reapportion three kinds of data:

counts, e.g. number of working-age adults in a given geographic area
proportions, e.g. percentage of pensioners in a given geographic area
weighted points, e.g. number of residents at a given set of coordinates

The latter case is the most complex one to illustrate. The two maps below show the 20th arrondissement of Paris. The map on the left shows the spatial incongruity between its polling stations and its census tracts, whereas the map on the right also shows where the voters of that arrondissement live, according to the Répertoire électoral unique (REU).

When interpolating from one (spatial) geometry to another, we want to take that information into account, in order to reapportion the data to the areas where actual observations are to be found. The results are starkly different once that correction is taken into account:

The example above is based on approximate data, as we are looking at voter addresses, rather than at the exact number of voters at a given address, but the corrective effect is still notable and possibly sufficient for our purposes.

Limitations

The sfReapportion package has only been lightly tested when it comes to its weighted modes. However, the main function, which uses unweighted population counts by default, has been thoroughly tested, and its results have been successfully reproduced with the areal package.

The sfReapportion package only performs extensive areal-weighted interpolation: for intensive or multiple (mixed) interpolation, users should turn to the areal package. Additional methods are also available from the populR package.

I do not plan to update the sfReapportion package much, as it was coded for reproducibility purposes, but users might open issues on its GitHub repository in order to ask questions or suggest improvements.

Update (March 28, 2026): version 0.2.0 of the package has been submitted to CRAN and should become available there soon. This post has been updated to document some of the new features. The code used to produce the last two plots is available from this Gist, which expands on the code provided in the README file of the package.

AI-generated code comes with security risks

2025-04-14T00:00:00+02:00

More and more people are using AI-generated code in their work, without necessarily understanding the security risks that comes with that practice.

How AI-generated code happens

Generative AI services such as ChatGPT use Large Language Models to generate computer code. These models are ‘trained’ against a dataset of publicly available code.

Many users of generative AI do not seem fully aware of what ‘publicly available code’ might contain, and therefore do not really seem aware, in turn, of the security risks that come with executing AI-generated code.

Using generative AI services in a learning environment such as academia raises many concerns. Security is only one of them.

Where the security risk lies

Programming languages like R are not sandboxed. This means that these languages can execute malicious instructions such as ‘erase every image on the hard drive of that laptop,’ or ‘replace all occurrences of “Jewish” in that text with “kike”.’

Sandboxing the execution of R code is possible, but this is not how R runs by default.

The risk is real and already active

Just like human languages have already been ‘poisoned’ in various ways, some of the computer code that makes up the public codebase on which Large Language Models are trained has already been ‘poisoned’ in various ways.

One of the ways that this has happened is through software packages. It is very easy to bundle harmful or malicious code into a software package, and then to give it a name that resembles the name of a legitimate software package.

Executing R code that contains such a package will pose a security threat to the user, equivalent to that of opening emails or attachments sent by unknown sources. The consequences can be relatively innocuous, or extremely serious.

Both AI-generated code and inattentive users can be misled into referring to these harmful or malicious software packages into their own code. The vulnerability will be triggered when the code is executed.

This scenario is not a view of the future. It is already happening.

Real-world example threat

Even a user like myself, who has learnt how to code in R for research purposes, can very easily write up a malicious software package.

An example of such a software package might do the following:

Scan all text files on disk for credit card information
Hide that information in a website address
Automatically open a Web browser and point it to that address
Collect the credit card information server-side
Delete as many files as possible on the hard drive

The steps above can be executed without the user noticing at all, or might execute in part or in full before the user can stop them from happening.

Privacy and security breaches of the sort are very easy to implement, and have been implemented in virtually all programming languages.

The risk is of course not limited to AI-generated code. Executing computer code from any untrusted source can lead to the same issues.

How to minimise the risk

R users should always check where their packages come from.

R users who use AI-generated code should be even more careful, and should also warn other users that their code was at least in part AI-generated.

It goes without saying that I have never, and will never, design the kind of attack described in this note.

Making R work in government

2024-09-29T00:00:00+02:00

This year's Ihaka Lecture is about making R work in government. It was delivered by Peter Ellis, the Director of the Statistics for Development Division at the Pacific Community (SPC).

A lot of the talk is based on very direct comparisons between R and other software:

The purpose of these comparisons is often to assert that R can do many forms of government analytics better than other software such as a spreadsheet editor (typically, Microsoft Excel). In this case, R replaces the older solution.

However, the talk also provides multiple examples of situations where R will have to be articulated with other tools, such as SQL or JavaScript:

In these cases, R integrates with the solutions in place. This is a very likely scenario in most organizations, and one that should be given a lot of attention to when teaching R to anyone who is already immersed in an analytics workflow such as government analytics.

igraph 2.0.0

2024-06-13T00:00:00+02:00

The igraph R package has reached version 2.0.0.

The igraph package is based on a C library, which is now fully available under the newer versions of the package:

This major release brings development in line with the igraph C library. Version 1.6.0 of the R package used version 0.9.10 of the C core. The changes in the 0.10 series of the C core are now taken up in version 2.0 of the R package.

There are a few breaking changes, but not that many.

A lot of work has been done upstream to have the authors of packages that use igraph to update their own code if needed. Thanks to the help of Kirill Müller, I recently updated my own ggnetwork package to that effect.

The igraph package has arguably the same stature in the R software ecology as the networkx package has in the Python software ecology. It is great news for the R community that it has made such progress in the recent months/years.

A security issue with R serialization

2024-05-24T00:00:00+02:00

A security issue has been found with how the R language serializes objects, and patched since.

The security issue is documented under CVE-2024-27322. It affects the serialization functions that were advertised in an earlier note.

The R Core Team recently reported that the issue has been fixed as of R 4.4.0, and that ‘any attack vector associated with it has been removed.’

This episode is a reminder that R is a programming language, and as such, that it raises the same security concerns as any other programming language.

Slightly over a decade ago, these concerns led Jeroen Ooms to develop the RAppArmor package, in order to enable users to restrict the execution environment of R through dynamic sandboxing.

Update (May 28, 2024): thanks to R Weekly for mentioning this note.

Data Science with R (and RStudio)

2023-08-23T00:00:00+02:00

This blog has been silent for a while, and the Covid-19 pandemic has forced me to ditch my R to-do list for 2021. I did, however, manage to assemble a few R-related things in the past couple of years. This note documents the main one, a Data Science with R (and RStudio) course aimed at social scientists.

Historical side note

Around two years ago, I was offered to teach R again at Sciences Po, in Paris, in a spirit close to the Stata-based course that I have been teaching there for over ten years.

I first taught R to social scientists in 2013, but had not repeated the experience since then, except through various short and often focused workshops. I almost got to teach such a course in 2017, just as RStudio Desktop was turning 1.0, but that course failed to materialize.

Many things have changed since 2013, and there is now much higher demand to teach R (and RStudio) to social science audiences. R and RStudio have improved a lot, and the tidyverse, which recently turned 2.0 while still changing a lot, has become a core component of most courses, including mine.

Teaching material

My own attempt to teach R, RStudio and the tidyverse in 2023 has been online for a few months, in the form of a GitHub repository with a few wiki pages, including a long list of readings, videos and Web links, and another list of other R courses.

I have also uploaded a tentative syllabus for the course:

The course has only run once so far, and there are many issues with it that I will try to fix in the coming months. The repository also misses some essential course items (the slides, and the solutions to the exercises), which I am however happy to share privately by email.

A cool aspect of the course is that another instructor, Kim Antunez, will be teaching her own fork of it in the next few weeks. Kim has invested a lot into turning the course into a full-fledged Quarto website, which I will share in a follow-up post once she is done building it.

My own way of teaching the course is more old-school, as I rely on weekly emails and a shared Google Drive folder. I will, however, put some effort in improving the slides and giving the course a Web page, in order to make it more fully and easily accessible online.

Going forward

I feel that I already have enough material to assemble a more advanced R course for social scientists, but first need to streamline this introductory course a bit more, in order to make the reading list, especially, a bit more focused and manageable.

I also feel that there will soon be more changes to the tidyverse that I will have to take into account. I still, for instance, use the %>% pipe for chain operations, whereas the current trend is to use the native |> pipe, introduced in R version 4.1.0, whenever possible.

This note is tangentially related to my previous notes on teaching with RStudio, on R as a data science language and on other technologies for data science.

A personal R to-do list for 2021

2020-12-13T00:00:00+01:00

This note lists the main things that I will be doing with R next year.

I took some kind of a break from R over the past 18 months. I plan to change that this coming year, and have compiled the following list of things that I want to explore, or come back to.

R Markdown

While I have fully transitioned towards “tidy data” and its wonderful packages, I am still not the type of R user who works in R Markdown documents (notebooks) like David Robinson so brilliantly illustrates in his videos.

I hope to get there next year, because it makes sense from a reproducibility perspective.

Panel models

Most of my quantitative methods courses involve either basic frequentist regression models, or models used on panel data by political scientists, who have imported a lot of their modelling standards from econometrics and political economy. The models and replication material that come with published articles are mostly coded in Stata.

One thing that I need to do this coming year is to check how easy (or difficult) it is, in 2021, to replicate those Stata-coded panel data models in R. This will involve looking mostly at cross-section time-series (CSTS) data, usually measured at the country level, and fitting some of the regression models available in Stata.

Web scraping

I have been wanting to upgrade my Web scraping skills for some time, and while my initial plan was to learn enough Python to switch to that language (and its excellent libraries) for Web scraping, I keep coming back to R due to lack of proper learning time and too little professional incentives to code in Python.

The specific thing that I will be coming back to is headless browsing.

Network models

Network models, and exponential random graph models (ERGMs) in particular, have improved a lot in the past few years. I have followed the literature at a distance, and need to dive into it again, especially for the part that focuses either on generalizing ERGMs beyond binary responses, or on taking time (temporal dependence) into account.

Bayesian models

I received my copy of Gelman, Hill and Vehtari's Regression and Other Stories this summer, and do not want that book to end up with Harrell's Regression Modeling Strategies and McElreath's Statistical Rethinking on my list of books that I want to read, but might never end up doing so. (The list is much longer than that, and also has everything by Hastie and Tibshirani on it.)

Let's call this to-do item my annual attempt at transitioning further towards Bayesianism, which is made easy in R thanks to the rstanarm package, to Bürkner's brms package, to Harrell's rmsb package, and McElreath's rethinking package.

Machine learning

The tidymodels framework and Julia Silge's videos offer a nice invitation to dive deeper into those things that many of us explored when "machine learning" was the keyword that any aspiring methodologist (or data scientist, or else) had to know something about.

Going back to learning about machine learning is something that I look forward to, and which I plan to do while looking again at Cosma Shalizi's course on data mining from 2019.

The to-do list above will have to finds its place alongside coding in Stata for one of my oldest courses, plus reading about many other things that have little to do with R or statistics Let's see if that will happen.

Update (December 14, 2020): corrected the authors of the brms package, with thanks to Dieter Menne, who spotted and reported the error.

Remember the change in the sample() function of R 3.6.0

2019-08-07T00:00:00+02:00

This note documents how the sample() function has changed since R 3.6.0, and how to reproduce its previous behaviour.

A recent blog post by Christian Robert reminded me that R had to fix its sample() function in R 3.6.0 and above.

The issue that used to affect the pseudo-random number generator (PRNG) at the core of the function is documented in a note by Kellie Ottoboni and Philip B. Stark, “Random problems with R,” which was extensively discussed on the R-devel mailing-list.

The note explains that (part of) the PRNG used by R 3.5.1 does not correct for the uneven spacing of binary floating-point numbers. The resulting quantization effect/error produces biased selection probabilities, to a sufficiently severe extent for the PRNG not to qualify as sufficiently pseudo-random.

The issue got patched in R 3.6.0, which led to the introduction of a method allowing to reproduce the former behaviour of the PRNG. The method is well documented on R blogs like Revolution Analytics or J. Kenneth Tay's Statistical Odds & Ends, and consists in adjusting the RNGKind option before calling the sample() function:

This is of course only useful if one depends on a particular random number generator and seed number, as set through set.seed, to reproduce the behaviour and results of R code written and executed before R 3.6.0.

At that stage, you might be tempted to add generating random numbers to the short list of hard things in computer science—which would be, in my view, entirely correct.

This note is obviously of the ‘note to self’ kind. The previous one in that category (successfully) aimed at reminding me to use the RDS format.

Update (August 14, 2019): to change the local state of the PRNG without affecting its global state, see this note by Evgeni Chasnovski.

Update (August 14, 2019): thanks to R Weekly for mentioning this note.