https://dataingovernment.blog.gov.uk/2019/05/02/a-discovery-into-data-publishing-formats/

A discovery into data publishing formats

Members of the Race Disparity Unit viewing notes on a wallchart

At the Race Disparity Unit, we produce Ethnicity facts and figures. This is a service which presents government data showing the different experiences of people from a variety of ethnic backgrounds.

We show the statistics using words, charts and tables, and we also make all the data available as CSV download files. Our hope is that this enables some of our users, such as analysts or data journalists, to reuse the data in their own work.

However, simply publishing lots of separate CSV files may not meet all the user needs. It can make comparing different datasets or building tools that automatically process the data harder to do.

To address this, we recently completed a short discovery which looked at how datasets are published across government, focusing on CSV datasets. We spoke to colleagues in the Office for National Statistics (ONS), Department for Work and Pensions and the Government Digital Service. We also spoke to the Open Data Institute, and spent some time exploring different data platforms (like CKAN and Swirrl) and publishers (like data.gov.uk and London DataStore).

Current data landscape

Our research suggested that open data is currently in the ‘trough of disillusionment’, a phase of the technology hype cycle that usually follows a peak of expectation and early adoption.

It also suggested that government open data platforms are not keeping pace with that expectation. One of the interviewees felt that historically there has been a focus on quantity of datasets, over quality. By publishing more and more datasets, it can make it hard to find the ones that are useful for you.

Work should be done to ensure consistency across large numbers of datasets and to keep them up to date. This work supports the Government Transformation Strategy which commits to making “better use of data”.

This has already begun. For instance, the team at data.gov.uk are publishing guidance to help data publishers and the ONS are exploring how to change their data publishing approach - from publishing spreadsheets to publishing data. Others have written about wanting to start a data revolution in government.

Moving to consistent formats

When assessing the quality of published open data it’s clear there’s a variety of formats used in different ways. The statistics section of GOV.UK contains a mixture of spreadsheets in ODS, Excel or CSV format – and sometimes the data is only available embedded within a PDF file.

Whilst CSV files are often assumed to be the most easy-to-use and interoperable format, we also realised that these can be problematic too. The Centrum Wiskunde and Informatica in the Netherlands highlighted some of the issues caused by things like multiple header rows, different character encodings and blank rows in a 2017 research paper.

There is a proposal for government to adopt the standard RFC 4180, which aims to solve some of these issues by being more specific about the type of CSV format. This is not yet formally adopted, but seems to be widespread, and there are tools available like CSVLint to check that files meet this format.

On their own though, CSV files do not do much to describe the data within them. This is often the job of the web pages which link to the files, but there are also proposals for formats to make this metadata ‘machine-readable’. There are at least 2 different, incompatible standards for this: CSV on the Web and Frictionless Data. Both have some tools available and some examples of use, but it feels too early to say which is most likely to have long-term value.

Get involved

We’re interested in how others across government are publishing their data, and which flavours of CSV they have chosen and why. If you have any thoughts, please leave a comment below.

Sharing and comments

Share this page

8 comments

  1. Comment by John Wilkins posted on

    Thanks for this Frankie. And keep pushing forward.

    We're in touch by email, but its worth me separately flagging that the Government Statistical Service (GSS) is currently working across Government on projects that try to grapple with very similar problems. Although there's many schools of thought on what approaches to take and audiences to target, the GSS tries to largely separately cater for users who want to learn the full story from a single formatted and footnoted spreadsheet via statistical tables, and separately for the open data community by publishing machine-readable raw data on portals like data.gov.uk.

    As it happens, the standards that Government statisticians are expected to meet when following both these approaches are being reviewed at present by different strands of the GSS Presentation and Dissemination Committee (https://gss.civilservice.gov.uk/about-us/governance/presentation-and-dissemination-committee/). I've been leading work on the formatted statistical tables side via the Web Dissemination Sub-Group (which Terence who commented above knows well), whilst unsurprisingly it is the Open Data Sub-Group who are leading on creating new guidance for GSS open data.

    Whilst these projects are targeted at statisticians, you may wish to at least keep them in mind - it is easy to forget that communities like ours often have our own standards that we're expected to meet, and that we are held publically accountable when we don't.

  2. Comment by Fajer Qasem posted on

    Great post Frankie... We're keen to hear a lot more about the practices in play by other departments and hoping to align views on how to manage data. Get in touch with us at GOV.UK Registers if you'd like to hear more and/or share what you're doing.

  3. Comment by Gary Davis posted on

    Lots of interesting points, but please be very wary of listening to your interviewees' predictions about what will help. No doubt you know this, but this article shows a few opinions already, like data quality not good enough and that "publishing lots of separate CSV files...can make comparing different datasets or building tools that automatically process the data harder to do".

    Years of peoples' lives have been lost trying to work out how to express metadata for intricate edge-cases. Actually most of it no-one cares about, or certainly doesn't need standardizing and making machine-readable. And everyone will tell you data should all be available on APIs, but if you watch what data journalists, analysts and data scientists actually do in most cases, they simply download the whole lot as CSV. Then they load it into a database or dataframe, so knowing the column types might be useful. Common identifiers get used, but no-one ever asked for a URI or RDF, despite what the Linked Data fanatics will try to tell you. You will get people banging on about the license being key, but the proportion of users who actually need to know the data license is tiny.

    > Our hope is that this enables some of our users, such as analysts or data journalists, to reuse the data in their own work.

    Great, so no doubt you'll want to know the typical user journeys that end up with a data journalist writing an article. Or what about a researcher for an industry publication, or writer for mainstream magazine, who can be encouraged to use more effective stats? What about activists or general public using social media who can be encouraged to use data - what are their needs, to drive this agenda?

    Anyway, keep up the good work, but please be ruthless about working with user needs, not user opinions.

    • Replies to Gary Davis>

      Comment by Frankie Roberto posted on

      Thanks for your comment. You’ll be pleased to know that we’ve been talking to data users, such as journalists, researchers and policy professionals, as well as data publishers.

      Based on this, I broadly agree that making the CSVs more usable (and with fewer errors) will satisfy most of the user needs. Including keys, such as ONS geography codes, has also been repeatedly mentioned as a useful feature.

      As to CSVW and Frictionless Data: both seem like they could be useful for adding a teeny bit of extra metadata for each CSV file – but neither seems to be widely adopted yet.

  4. Comment by Terence Eden posted on

    Thanks for this post. It is frustrating that data are still locked away in PDF.
    I hope your research leads to a renewed interest in pushing forward a fully open data format. I encourage readers to leave comments on the Open Standards GitHub page - https://github.com/alphagov/open-standards/issues/58

  5. Comment by Benjy Stanton posted on

    Hi Frankie, thanks for sharing! It's great to hear about other user-centred design teams tackling this subject. Be great to have a more in depth chat some time. For what it's worth, the stakeholders I've spoken to tend to prefer CSV on the Web.

    Related: this recent blog post seems relevant too: https://www.nesta.org.uk/blog/you-can-lead-person-data-you-cant-make-them-use-it/

  6. Comment by Jamie A posted on

    Great work all fascinating read

  7. Comment by BEIS Transparency and Data Services Team posted on

    Hi - BEIS would be very interested to hear about your project. Team here is willing to organise & hold a x-Whitehall & ALBs Data Publishers forum. Happy to reach out via Basecamp & Slack.
    Please contact BEIS Data Services (Transparency) team.