Sustainable code for social scientists

Over at his blog “One Tip per Day”, Xianjun Dong has produced an excellent list of “15 Practical Tips For a Bioinformatician”. This note is my own version of these tips, aimed at social scientists who need to write sustainable (i.e. reproducible) code for either individual or collective research projects.

Required software

This note is based on my experience as a user of R and Git-via-GitHub. Most recommendations can be applied to other languages, such as Python or Stata, and it is certainly possible to use a different CVS technology than Git+GitHub, although I strongly believe that it will make things more difficult than they can be.

Things that you should always do

1. Document your code

Xianjun Dong's tips recommend that you "always document your code properly" (#5), and that you document the code with a README file (#4). It is also good practice to include credits, either directly in the code or in its documentation, when you use someone else's work (#12).

The first answer to this StackOverflow question provides some good advice on writing an efficient README, and when the code is not bundled as a package that can be illustrated with a few minimal examples, I have also found it very useful to include a HOWTO file with detailed replication instructions.

2. Backup your work

Xianjun Dong’s tip #6 is the very standard advice that you should follow as soon as you are using a computer: "Always backup your code timely." There are several options to back up code, from manual versioning to full-fledged versioning through Git+GitHub, which is what I would recommend for any serious project.

GitHub's release system is a great way to organise your code backups, as it also invites you to write a changelog of how your code has evolved since the last release. Using semantic versioning to number the releases properly is highly recommended, as it adds a layer of human-understandable logic to your workflow.

3. Use relative file paths

Xianjun Dong's tip #2 is about the use of temporary folders to dump the results of your code. A generalisation of his advice might be: always use relative file paths in your code, in order to make your code executable from any location. Using RStudio projects makes it very simple to work from any folder, so there is no excuse not to do this.

In R or Stata, this means that your code should not start with cd or setwd, as is commonly found in the replication material published by many journals. These lines are useless to anyone but the original author.

4. Use valid variable names

Xianjun Dong's tip #3 is worth quoting in full (my emphasis):

Always use a valid name for your variables, column and rows of data frame. Otherwise, it can bring up unexpected problem, e.g. a ‘-’ in the name will be transferred to ‘.’ in R unless you specify check.names = FALSE.

The only special character that I use in variable names is the underscore. My recommendation is to ban all other non-[a-z0-9] characters, and to stick with lowercase characters to avoid CamelCasing, which is just plain annoying.

The same tip can be applied to function and argument names, although some people seem to have different and very strong preferences in that domain.

5. Set a fixed seed number

Xianjun Dong's tip #1 is a modelling tip: whenever you find yourself using functions that include a randomised component, remember to set the seed number manually, in order for the results to be reproducible at later stages. This tip is highly recommended to anyone who works a lot with MCMC estimators.

Things that you generally want to do

6. Archive intermediary files

Xianjun Dong's tip #7 recommends that you clean up your work before sharing it, and indeed, it is generally a good idea to delete intermediary files before sharing your code. Tip #10 recommends to also save all intermediary files prior to deleting them, just in case they might be useful in the future.

If you are using Git+GitHub, setting up a .gitignore file is extremely simple and will keep your repositories clean. If those files take a lot of disk space, software like DropBox or Google Drive will let you dump the files into the cloud to any folder of your choice, which you can then un-sync to save disk space.

7. Use open file formats

Xianjun Dong's tip #8 is about file compression. Let's extend it to discuss file formats: whenever possible, use open file formats that do not depend on proprietary software. In the social sciences, this usually translates to: use CSV as your dataset format, and something like ZIP as your compression algorithm.

One issue with the CSV format is that it does not handle metadata by default: as a result, you will usually have to write the codebook of your dataset as a separate text file. It is possible, however, to insert metadata into the CSV format as a YAML header, although reading the file now requires coding a small parser.

If you are using Mac OS X, make sure that you do not save Mac resource forks in your ZIP archives, as the default OS X compression tool does. The best solution is to use additional software like Keka, which will also enable you to do much more when zipping files (such as adding password protection).

8. Write functions and objects

Again, Xianjun Dong's tip #11 is worth quoting in full (my emphasis):

Make your code sustainable as possible as you can. Remember the 3 major features of OOP: Inheritance, Encapsulation, Polymorphism. (URL)

This tip can translate to many different things, but if you are coding in R with the aim at writing the kind of scripts that are needed for much social science analysis, you want this tip to translate to the following:

Learn the basics of functional programming and get to write functions that use control flow statements like stop, stopifnot and warning to handle exceptions.
Learn enough on R objects to be able to manipulate any kind of S3 or S4 class without getting put off. The str function will soon become your best card in the game.

9. Engage in continuous learning

Xianjun Dong's tip #13 is very important: "keep learning." This part is the coolest part of coding: in a highly dynamic environment like the R community, you get to constantly learn new things, discover new packages, try new examples and run new models all the time. The instant gratification of new plots and results is always very welcome.

10. Engage in community feedback

Xianjun Dong's tip #12 is a complement to the previous one: whenever you learn something interesting or succeed at coding something difficult on your own, share your discoveries with others. Even if it takes the form of an obscure tweet, it will always helpful to someone if you contribute to the documentation pool.

If you have the time to do so, contributing to StackOverflow is also a great way to improve your own coding and computing skills while helping others in the process. I did it for some time, and only regret that I do not have the time to do it now. If you do not know StackOverflow, add it right now to your bookmarks!

Things that you might need to do

11. Use parallelization

Xianjun Dong's tip #9 recommends to use parallel computing whenever possible. In the social sciences, a lot of stuff can be accomplished without ever using parallelisation. However, as soon as you start using Bayesian and/or multilevel methods, for instance, it might become compulsory to use it.

Read this funny guide to learn the basics of parallelism, and then turn to packages like this one if you need to parallelise MCMC estimators. If you need more references to get started with parallelisation in R, read the CRAN Task View on High-Performance and Parallel Computing with R.

12. Review your ergonomics

Xianjun Dong's last tip is about making sure that you "stand up and move around after sitting for 1-2 hours." The more general point here is that your workspace needs to be ergonomically configured, and that you need to take basic health advice into account when structuring your workday.

Just as on every other tip of this list, my own example is not the best to follow, but my usual workday does include at least 30 minutes of physical exercise, so that I do not turn sedentary. I have also recently begun to limit my coffee consumption to two cups in the morning and one in the afternoon, which I recommend to everyone.

These last tips will not make your code more sustainable, but it will make yourself last longer despite your coding addiction.

First published on February 19th, 2016

Other notes