Remember the change in the sample() function of R 3.6.0

This note documents how the sample() function has changed since R 3.6.0, and how to reproduce its previous behaviour.

A recent blog post by Christian Robert reminded me that R had to fix its sample() function in R 3.6.0 and above.

The issue that used to affect the pseudo-random number generator (PRNG) at the core of the function is documented in a note by Kellie Ottoboni and Philip B. Stark, “Random problems with R,” which was extensively discussed on the R-devel mailing-list.

The note explains that (part of) the PRNG used by R 3.5.1 does not correct for the uneven spacing of binary floating-point numbers. The resulting quantization effect/error produces biased selection probabilities, to a sufficiently severe extent for the PRNG not to qualify as sufficiently pseudo-random.

The issue got patched in R 3.6.0, which led to the introduction of a method allowing to reproduce the former behaviour of the PRNG. The method is well documented on R blogs like Revolution Analytics or J. Kenneth Tay's Statistical Odds & Ends, and consists in adjusting the RNGKind option before calling the sample() function:

This is of course only useful if one depends on a particular random number generator and seed number, as set through set.seed, to reproduce the behaviour and results of R code written and executed before R 3.6.0.

At that stage, you might be tempted to add generating random numbers to the short list of hard things in computer science—which would be, in my view, entirely correct.

This note is obviously of the ‘note to self’ kind. The previous one in that category (successfully) aimed at reminding me to use the RDS format.

Update (August 14, 2019): to change the local state of the PRNG without affecting its global state, see this note by Evgeni Chasnovski.

Update (August 14, 2019): thanks to R Weekly for mentioning this note.

  • First published on August 7th, 2019