We’ve written a lot recently about how we’re building the GOV.UK of the future. This means making things easier for users through things like considering voice assistants and building step by step navigation.
In a recent podcast, outgoing GOV.UK head Neil Williams talked about these things as evidence the site is ‘iterating wildly’ again.
None of this exciting new work would be possible if we had not first sorted out the fundamentals, like organising all the content on GOV.UK. This was a huge task and doing it manually could have taken years.
So we - the data scientists at the Government Digital Service - tried to help, by using supervised machine learning to tag all our GOV.UK content.
Why we used supervised machine learning
Supervised machine learning is an algorithm that learns patterns from a sample of data that has already been classified, so it can use them to classify new content. We realised we could use it to build on the manual work that was being done to set up a GOV.UK taxonomy.
A year or so earlier, the Finding Things team started developing a new subject-matter taxonomy for GOV.UK. The aim was to set up 19 large themes, working with departments and agencies to get everything organised.
The team started with the Education topic and built a tree-structure taxonomy, with a number of ‘sub-branches’ or sub-topics. Things like ‘Education of disadvantaged children’ and ‘Funding and finance for students’.
Although this work was going well, it was taking quite a long time and was very resource and engagement heavy - both for us at GDS and for the departments we were working with.
Supervised machine learning could dramatically speed up this approach and, we hoped, help get a topic taxonomy with all existing content tagged to it in a matter of months rather than years.
What we did
As data science is a team sport, we embedded in the existing team that was building the taxonomy and worked with them to build a robust model. We used these 3 data sources:
- the taxonomy tree structure and all the sub-branch levels
- a sample of GOV.UK pages that had already been tagged and organised in sub-branches
- a lot more pages that were not tagged or organised at all
We could use the pages that had already been tagged to train an algorithm to recognise patterns in the page contents. We could then use this algorithm to predict the correct tags for the pages that had not yet been organised.
We used Natural Language Processing to make the text content on the page machine readable. We used this and the page metadata (like date published and department) to learn patterns that could be used to predict which sub-branches an untagged page should be organised in.
In some cases, a GOV.UK page is tagged to more than one sub-branch, so we implemented a multilabel model to be able to do exactly that.
In addition, as the taxonomy tree is constantly being curated and improved by content designers, our model needed to be easily adaptable to changes.
We had about 100,000 untagged pages to tag and we aimed to tag them with around 210 sub-branches.
How the work went
To choose a supervised learning algorithm for the job, we first used TPOT. This is a Python tool that explores thousands of possible combinations to pick the best algorithm for your data.
But we found a Convolutional Neural Network - a type of deep learning algorithm - outperformed all the classical algorithms that TPOT explored. So we started with this simple architecture. But we had to do quite a bit of iteration to get it right.
For example, the first time we ran the algorithm, it took a page with the title ‘Government launches its first ever national bowel cancer campaign’ and suggested the tag ‘counter-terrorism’. So clearly we needed to tweak it a bit.
We found the algorithm was overfitting - in other words it had learned too much and it needed to forget a bit. So we changed the settings and fine-tuned our parameters and then tried again.
This time, we got the suggested tag National Health Service and Public Health. Success!
Rolling the work out
We put our results in front of user researchers and content designers on GOV.UK and they were very happy. In a couple of sub-branches with few training examples, or very diverse content, the model did not perform sufficiently well. So we ignored the predictions to those sub-branches and started to roll it out everywhere else.
Using machine learning we have managed to tag 96% of the untagged pages on GOV.UK - that’s roughly 100,000 pages. And we are now able to predict a sub-branch for new content coming in.
Unlocking more work
We now have a draft taxonomy of everything and nearly all the content on GOV.UK is organised within it. The full taxonomy has been released to publishers to use and at GDS we are putting governance in place to allow for constant improvement and iteration.
We are also running experiments on how auto suggestions might work in the publishing applications. Our ambition is that the whole site can be powered by a reliable and continually improving taxonomy - and we can suggest tags for new content to publishers as it is created.
Having this fundamental structure in place and the ability to evolve it has unlocked all of the other work we’re doing on GOV.UK - such as step by step journeys and voice activation. And by using machine learning we have made this process quick and easy both for us and the publishers across government we work with.