Introduction to Data Analysis
For our final session(s), we'll have a political scientist, a journalist and a member of the Open Knowledge Foundation come and talk to us about what they do with data, how they do it, and why this concerns you. Below are a few more ideas on the topic.
Reproducible science is science that you can reenact at home to see whether the analysis checks out, and to what extent the data are reliable. It's also the idea that scientists would benefit from spending more time versioning their research and working on each other's stuff.
It outsmarts junk science. It works with R.
Let's consider data and its uses in scientific research, and let's actually include scientific publications as a special form of data. Data is not just a tool of the trade for scientists: sharing it is also part of the solution against the corruption of the scientific field by fraud and corporate control. It carries both research and ethical value for the scientific field, even though many of its members hardly care.
It is still quite difficult for scientists to share scientific writing or data. There are few open access repositories like arXiv, and open access publishing is not (yet) a dominant value of academic research. Very few scientists, however, are actively engaged in building a more open science, probably due to lack of incentives.
Calls for higher standards in reproducibility and access, as well as higher computer proficiency and higher-quality tools for code and data sharing, might be driving an interesting trend for all computational and empirical sciences. Right now, however, publishing a lot is more important than actually sharing your work with the public.
Data-driven journalism (DDJ) is more than one thing, and several of its aspects can also be called 'computer-assisted reporting'. The general principles of DDJ are well embodied in Geoff McGhee's video presentation of the topic, called “Journalism in the Age of Data”. The video covers the DDJ work done at the New York Times, which published a great recap of its “Year in Graphics” at the end of 2012.
Here's a data journalist writing about software and his job.
If you wish to reuse some of the data put together by data-driven journalists, the Guardian Data Blog, a leading DDJ initiative, has published an impressive list of all its datasets so far. The data are accessible as Google Spreadsheets that are easily importable in R.
From the perspective of a social scientist, let's add a few words. The virtual meetup of journalism, stats and hacking is a rather interesting one. The print media is not any profession: it's one involved in mass communication and in shaping political representations.
Any long-lasting effect of 'data people' onto the rules of the journalistic field and news reporting would deserve attention. Typically, this is not an easy world to subvert: it contains hard-coded rules that might not interact well with the slowly legitimizing capital of geeks and über-geeks.
Furthermore, your future line of work might include press releases and/or press reviews and/or report and research publications. It is not trivial to observe that what is happening to journalism is likely to affect other fields of public communication: this is already happening within IGOs, NGOs and think tanks, and is probably spilling over to many other forms of organizations, including private sector ones.
In the words of Aaron Swartz, a political activist who died while being prosecuted by the United States federal authorities for downloading data from the JSTOR academic journal database:
The open data movement is a hammer which has gathered the support of many nails. There are the curious taxpayers, who feel their annual checks mean they deserve a peek at the interesting facts the government has collected. There are the ambitious business owners, who see an opportunity to privatize profits from work with socialized costs. And there are the self-styled activists, who believe that if we reveal the data on what the government is really doing, we will arrest corruption by exposing it to sunlight.
Below are a few initiatives. There are many more listed at Blog about Stats by Armin Grossenbacher, who works at Statistics Switzerland as Head of the Dissemination and Publication unit. For background reading, see the seminar on open data published by the Crooked Timber academic blog.
That's it! Go back to the bottom of the index page to read a few final words on the course.