Remember to use the RDS format

Note to self – Remember to serialize R objects as RDS files when it makes sense.

Importing Stata data into R

The European Social Survey recently announced that it had added Round 7 of its survey to its cumulative dataset, which can be downloaded in CSV, SPSS or Stata format.

While my instinctive preference for storing data is to use CSV, in the case of survey data, many/most measurements come with detailed variable and value labels.

Furthermore, as is the case in the European Social Survey, the missing values of survey data generally take several different values to code for different forms of nonresponse, depending on whether the respondents “did not know” what to answer, provided “no answer,” or “refused to answer” the question.

For these reasons, I tried to download the European Social Survey as a Stata dataset, only to realise later that the data had been produced with Stata 14—which means that it cannot be opened with older versions of Stata, unless the data were saved with the saveold command and with the appropriate argument for my version of Stata.

Fortunately, I was able to read the data in R with haven. The package, which wraps around the ReadStat C library, can import SAS, SPSS and Stata files. Once imported, the data are available as a standard data frame, with value labels accessible via functions like print_labels and as_factor.

Saving the data as a RDS file

Another issue that then I faced with the European Social Survey dataset was its size: while only 103.5 MB compressed, the uncompressed Stata DTA file for the complete (all variables, all waves) version of the cumulative dataset is extremely large: 3.16 GB.

In comparison, the CSV file for the same dataset, which does not contain labels or detailed missing values, is 58.1 MB compressed and 559.7 MB uncompressed.

Here again, R offers a superior alternative to both the CSV and Stata formats: by saving the file as a RDS file, which creates a serialized version of the dataset and then saves it with gzip compression, I was able to bring the size of the dataset down to 51.6 MB.

Note that, when loaded into R, the RDS object still takes around 3 GB of (live) memory.

The full code used to convert the European Social Survey data from the DTA (Stata) to the RDS (R) format follows. The code requires the haven package, which is part of Hadley Wickham's tidyverse package suite.


Update (December 14, 2016): having discussed the issue on Twitter, it appears that the data mentioned in this note can be compressed quite efficiently in Stata. That operation, however, requires Stata 14 or above, if Stata keeps its commitment backwards compatibility. There is currently no other way to load the file in lower versions of Stata.

  • First published on December 12th, 2016