Data Science combines statistics, programming, machine learning, automatic processing of unstructured data (including text mining) and visualisation. This can seem bewildering - many people are asking where to get started if they want to learn. This blog gives free resources that the GDS Data Science team like (there are also many other good resources online).
Introducing Data Science
For a general introduction to data science, we recommend Wikipedia, the Data Scientist’s Toolbox and Data Science Central. Drew Conway's diagram is an introduction in itself...
Choose a project
This is perhaps the hardest part, but it’s much easier to learn if you have a real problem to wrestle with. Try to find something where the data is easily or publicly available and where decisions based on your analysis could improve your product or save money. Examples of projects we’ve already done are on our prototypes page. You may also want to look at the projects on Kaggle.
Setting up your computer to do Data Science
We have found that Python and R are the most useful programming languages for data science. We recommend the iPython notebook and R Studio as good applications for writing code. A good text editor is also extremely helpful (we recommend Text Wrangler).
If your work firewall blocks you from downloading these tools, you may want to ask your employer to sponsor a standalone laptop to develop your data science skills (this may require a business case and should not be used for confidential data). If your employer is unwilling to do this, there’s nothing to stop you downloading the tools on your own computer (providing that you’re only using publicly available data and your project doesn’t raise ethical concerns).
Programming
Hadley Wickham is probably the world’s leading R educator - his excellent Advanced R is a brilliantly clear online textbook. There is also a good CodeAcademy Python course.
Machine Learning
Data Scientists often use a range of modern statistical techniques called machine learning. It’s extremely important for you to understand the principles underlying these methods, otherwise you may produce work that looks impressive but isn’t statistically valid. We recommend this free course from Andrew Ng, one of the world’s top machine learning experts.
Visualise your results
Your first project will get higher impact if you can visualise your results on the internet. Perhaps the simplest way is to host it is on Google drive (see this guide); for more advanced work, you can deploy your visualisation to Heroku using a framework like Python Flask.
Show the Thing!
Once you’ve done your project, demonstrate it as much as possible. That way you will influence decision makers and get colleagues interested in data science. Why not come and show us what you’ve done at our Data Science Show & Tells, or add your own favourite data science resources in the comments below.