Scraping legislative data with R: A progress report

This note discusses the results of this project, which collects legislative data from several European parliaments (plus Israel). The project is coded in R, which has had consequences on its development.

The project

In a nutshell, the parlnet project scrapes private bills from 20 national parliaments, and then converts the sponsorship information of these bills into legislative cosponsorship networks, which James Fowler pioneered ten years ago. I was also inspired by a blog post written by my sociologist friend Baptiste Coulmont.

The scrapers brought together in the parlnet project rely predominantly on the XML and rvest packages, while the network construction routines rely predominantly on the network package. I have discussed working with network objects in R in another note, and further discuss scraping with R below.

You can learn more about parlnet by reading this short presentation note, which comes with a detailed technical appendix. Both papers are forthcoming in the Network Science journal. I have also released static plots and interactive visualizations of the networks built by parlnet. The code for this last part of the project relies predominantly on the Sigma JavaScript library.

The language

R was a good candidate to code the parlnet project because its packages can handle both Web scraping and network construction operations. However, some of the scraping performed by parlnet requires manual inputs or running some scripts several times in a row to solve the network errors that inevitably arise when scraping thousands of files, even from reliable sources like official parliamentary websites.

As a result, the scrapers included in parlnet are not “autonomous” scripts that might be fed to a service like Morph.io, which does not currently support R code anyway. If there were less data to scrape, and if manual inputs were not required, a programming language like Python or Ruby would have certainly been a better fit for this project.

Since the R code for parlnet is unlikely to ever produce self-updating datasets, improvements have to be focused on other aspects of the code, such as fixing existing scraper limitations, improving the measurement of existing variables, extracting more variables from the data, or facilitating access to the raw data, which are currently released as large zipped archives hosted on Zenodo.

To some extent, that list of improvements is derived from what the R language can and cannot do, as well as from my (necessarily limited) experience with other social science datasets and with other relevant technology, such as Git.

The future

The parlnet project is currently approaching version 3.0. Over time, it has gone from covering just one country to covering many of them, up to the current count of 20 countries and 27 parliamentary chambers. Coding the entire project with R and GitHub has incited me to document all changes as I went from managing just a few scripts to a much larger array of code.

Many possible improvements to parlnet depend on external data sources, such as IPU-PARLINE, ParlGov, Open Civic Data or Wikidata. Any improvements of these sources will also indirectly improve parlnet. There is also room to improve parlnet by adding additional countries to it, or by turning to other forms of legislation than bills, which I have already started exploring.

Some of these improvements would require better data sources, such as official parliamentary websites that do not capture cosponsorship data into PDF files, which are impossible to parse properly; some technology does exist to extract text data from PDF files, but it is sufficiently advanced to be operated on large batches of files that contain manual writing.

Other improvements might come from a better understanding, on my end, of how to scrape search forms with R, a topic that I have briefly documented through an example directly taken from parlnet in another note.

First published on February 7th, 2016

Other notes