We show the statistics using words, charts and tables, and we also make all the data available as CSV download files. Our hope is that this enables some of our users, such as analysts or data journalists, to reuse the data in their own work.
However, simply publishing lots of separate CSV files may not meet all the user needs. It can make comparing different datasets or building tools that automatically process the data harder to do.
To address this, we recently completed a short discovery which looked at how datasets are published across government, focusing on CSV datasets. We spoke to colleagues in the Office for National Statistics (ONS), Department for Work and Pensions and the Government Digital Service. We also spoke to the Open Data Institute, and spent some time exploring different data platforms (like CKAN and Swirrl) and publishers (like data.gov.uk and London DataStore).
Current data landscape
Our research suggested that open data is currently in the ‘trough of disillusionment’, a phase of the technology hype cycle that usually follows a peak of expectation and early adoption.
It also suggested that government open data platforms are not keeping pace with that expectation. One of the interviewees felt that historically there has been a focus on quantity of datasets, over quality. By publishing more and more datasets, it can make it hard to find the ones that are useful for you.
Work should be done to ensure consistency across large numbers of datasets and to keep them up to date. This work supports the Government Transformation Strategy which commits to making “better use of data”.
This has already begun. For instance, the team at data.gov.uk are publishing guidance to help data publishers and the ONS are exploring how to change their data publishing approach - from publishing spreadsheets to publishing data. Others have written about wanting to start a data revolution in government.
Moving to consistent formats
When assessing the quality of published open data it’s clear there’s a variety of formats used in different ways. The statistics section of GOV.UK contains a mixture of spreadsheets in ODS, Excel or CSV format – and sometimes the data is only available embedded within a PDF file.
Whilst CSV files are often assumed to be the most easy-to-use and interoperable format, we also realised that these can be problematic too. The Centrum Wiskunde and Informatica in the Netherlands highlighted some of the issues caused by things like multiple header rows, different character encodings and blank rows in a 2017 research paper.
There is a proposal for government to adopt the standard RFC 4180, which aims to solve some of these issues by being more specific about the type of CSV format. This is not yet formally adopted, but seems to be widespread, and there are tools available like CSVLint to check that files meet this format.
On their own though, CSV files do not do much to describe the data within them. This is often the job of the web pages which link to the files, but there are also proposals for formats to make this metadata ‘machine-readable’. There are at least 2 different, incompatible standards for this: CSV on the Web and Frictionless Data. Both have some tools available and some examples of use, but it feels too early to say which is most likely to have long-term value.
We’re interested in how others across government are publishing their data, and which flavours of CSV they have chosen and why. If you have any thoughts, please leave a comment below.