GDS has been looking at how government can save money in business processes through the use of data standards and the interoperability they bring.
Through improving how government processes talk to each other, we can deliver users simple, seamless and connected services. We can also minimise the amount of data shared to achieve a specific purpose.
As government, we can improve productivity by reusing non-personal data that is already available, ensuring our infrastructure and services contain consistent information, and reducing unnecessary demands for data storage.
For those who work in government to analyse data sets, we can improve efficiencies and encourage more data analysis to take place, making us more aware of the impact of new policies or strategies.
Business process relies heavily on spreadsheets
People involved in the business processes put data into spreadsheets and pass them around between each other. One government department told us that their people managed data using spreadsheets around 60% of the time.
On GOV.UK alone, we found 31,121 files with spreadsheet extensions (like .ods, .xlsx or .xls) and 19,251 CSV files, many of which were not in the suggested tidy, tabular data format.
The GDS Registers team previously found users do not consider the things they put into spreadsheets to be data and we found that they treat spreadsheets just like any other office productivity tool.
Spreadsheets give their users autonomy and agency
Spreadsheets seem to give users a sense of control and allow them the flexibility they need when executing a predefined business process. Affordances such as whitespace and formatting make them easy to use.
We have recognised that people are an essential part of the business process rather than just being the owners and implementers of it. People adjust how they work and are continuously making the process safer. For example, they are able to spot potential fraud or mistakes and know how to take steps to correct things. As found by Richard Cook, the human role in the process is often as producers of information and defenders against poor data management and analysis.
Efforts to replace spreadsheets frequently fail
With this in mind, it becomes easier to understand why efforts to replace humans and spreadsheets with a more rigidly defined piece of software frequently fail.
There are often cases where people receive a digital service from their developers and all goes well for a while. Eventually they would like to make a small change to the system. They hear that this change will cost some money or perhaps take a small number of days to implement. In the meantime they go back to their spreadsheets. Because they're not in direct control of their own tools they gradually revert back to their spreadsheet-based workflows.
Therefore, not only is the most valuable government data currently locked up in spreadsheets, it looks like it will remain so going forward.
Extracting structured data from spreadsheets
Since spreadsheets are probably here to stay, We've been looking at how to extract structured data from them.
Spreadsheet programs such as OpenOffice and Microsoft Excel have a ready-made, powerful user interface and a rich data model. Users can annotate cells with information type, such as whether something is a currency and how many decimal places of accuracy they require. But when this data is exported as a CSV, much of this information is lost.
Therefore, we looked at how to extract the data directly from the original source spreadsheet files stored in open standards compliant formats.
We built a tool that takes a simple but messy spreadsheet, extracts tabular data from it, validates the correctness of each cell and then outputs the data as a JSON file that can be easily consumed by the rest of the extraction and analysis process. Once the data has been extracted and validated it could also be output as a CSV containing tabular data, along with a CSV on the web metadata file that describes it.
This allows humans to stay in control of the process while exploiting the value in data.
Extracted data becomes more useful
Data that has been extracted, becomes interoperable and more useful, and can more easily be joined and linked together with data from other sources.
Without interoperability, analysis can be limited. Metcalfe's law tells us that the value of an information network grows with the square of the number of datasets connected to it. By making a spreadsheet interoperable the data becomes more valuable and opens up opportunities for multidimensional analysis and reuse.
Defining a simple metadata language
We have defined a simple metadata language that is used to describe the data and how it can be extracted from a spreadsheet. The language itself is structured so it can be written into the first page of a spreadsheet.
Metadata files written in this language are easy to write - they can be written either by the original author or the following users of the data. They can be shared, which makes their adoption easy and keeps the barriers to entry low.
We also did some mockups of a simple service that would allow a user to upload a spreadsheet and then describe the data within it, see figure below.
Our next steps
The GDS Data Standards team is currently talking to departments about their use of spreadsheets and how consistency in extracting information can help improve interoperability.
Please get in touch if you would like to share your thoughts!