On the 19th February 2015, GDS hosted the 4th Data Science Show & Tell. The topics of discussion ranged from galaxy classification to data journalism.
David Bonfield, an ex-astronomer and a current data scientist at BIS, demonstrated a technique called SMOTE (Synthetic Minority Oversampling Technique). SMOTE is particularly useful when one category is under-represented in a dataset. When data categories are imbalanced, the classifier is unable to accurately predict the category that a data point should be in. To compensate for this, synthetic data is used to balance the categories. This method has been successfully applied in the field of galaxy classification.
Matthew Hodgkiss, from HSL, neatly segued the discussion from improving classifier accuracy to improving the efficiency of databases. Relational databases, that often form the backbone of big data analytics, can become inefficient as the complexity and size of the data grows and more joins between datasets are needed. To remedy this problem, graph databases have been proposed as a new and efficient way of querying large amounts of data with complex relationships. Graph databases have been successfully employed to run the FIND-IT tool by HSL. The FIND-IT tool allows the user to easily query a rich set of information about any business establishment in the U.K.
We were also fortunate to have Ransome Mpini and John Walton from the Digital Journalism team at BBC. At this master class on responsible data journalism, Ransome and John took us through their process of building popular calculators such as Which sport are you made for? and Care in the UK.
It was a pleasure to hear about the effort the team undertook to liaise with subject matter experts to ensure that the results produced were meaningful for each individual user. For the team at BBC, their philosophy has been “testing, testing, testing”. I am sure this is one mantra that will resonate with all data scientists.
There were more discussions on sentiment analysis and this will be focussed on in part two of our blog on the 4th Data Science Show & Tell.