Skip to main content

Peer-reviewing software for data analysis

Posted by: , Posted on: - Categories: Data science

Duncan Garmonsway working on a laptop

At GDS we believe that making things open makes things better, which is why I recently participated in an open peer review of open code written by Public Health England for analysing their open data.

Public Health England has been publishing open data on their Fingertips API for a while, and they’ve also made it easier for people inside and outside government to analyse their data using R, via their package fingertipsR. This makes it possible to carry out data analysis in code and openly, which in turn makes analysis and decision-making more transparent.

Why use open source code for data analysis?

To summarise many other articles on the benefits of using open source code for data analysis:

  • Analyses and findings can be made reproducible by writing them in code because, unlike a point-and-click tool, the code can be re-run to confirm a result or to apply the same method to new data.
  • The choices made by the analysts can be made open, because they are embodied in the code that can be made open. This includes details that might not be mentioned in written reports, such as how missing data is handled, or how different levels of a factor were encoded.
  • Anyone inside or outside government could run an open analysis themselves and suggest improvements, because fingertipsR uses a freely available programming language and data.
  • Mistakes can be avoided by not having to reinvent the wheel – commonly used code is wrapped up in a package, to be used in lots of different data analyses.

Open doesn’t mean perfect

As anyone who uses open source software knows, just because something is open doesn’t mean it is perfect. An outsider will often notice potential improvements to be made, such as making the software easy to install on their particular machine, making the documentation clear to people without the same domain expertise as the authors, and taking opportunities to use new methods and techniques that they have encountered in their own work.

One way to get an outsider’s perspective is to arrange a formal peer review, similar to the ones academic journals do before they publish a paper. Doing things formally means the people involved expect it to happen and can set aside enough time to do it well. It can also be easier to give and receive suggestions formally than it would be via unsolicited GitHub issues. A formal peer review process should find a good reviewer too – someone who is a ‘peer’ in that they use and develop similar software.

For R packages, there is even an organisation to coordinate the peer review process for you. It’s called rOpenSci, and, as its name suggests, it promotes the development of open software for science. One of the ways it does this is by arranging peer reviews of R packages.

Anyone can submit an R package for peer review by rOpenSci's expert volunteers. If the package fits into the rOpenSci’s scope (data retrieval from APIs is in scope), then an editor will find two appropriate people from the rOpenSci community to review it. Since the whole process is open, you can see the calibre of the reviewers by looking up their profiles from their reviews.

Getting an outsider’s perspective

Once the review is complete and any suggested improvements have been made, the package will hopefully be accepted for publication on rOpenSci's package repository, its blog (if the author writes a post), and social media. And it doesn’t stop at publication – the people in the rOpenSci community continue to help each other maintain the high quality of their packages. The net effect is felt downstream, as thousands of R users benefit from using the high-quality packages.

The peer review in practice

You can see how this worked in practice by reading the GitHub issue. Things I mentioned in the review included a fix of a failing test, removing some redundant code, updating a function to use a new feature of the API, and clarifying some points in the documentation. I also learned from the other reviewer’s neat suggestions for avoiding heavy dependencies, and from the author’s clever method for presenting a table in the docs. The author implemented all the fixes, so users are now benefiting from a couple more worked examples, a more logical flow to the documentation, reduced installation times and faster performance.

I was delighted to be asked to review this package because:

  • it connected me with other R developers in government
  • I learned from their code, from the other reviewer’s comments, and I learned something about public health
  • it bore out the GDS design principle in real life, by making something better because something was open

If you're an R developer, I encourage you to contribute to the common good by submitting your packages for review and volunteering as a reviewer. Not only will you learn and share knowledge, but you will make it easier for everyone to use and understand open government data and data analysis.

Sharing and comments

Share this page