News:  Data Science (10/07/2017)

Data Science The IT industry marketing buzzwords come and go, sometimes with little actual impact. One such example is ‘data science’. What is it? Here’s a definition for starters: Data science is the exploration and analysis of structured and unstructured data to develop understanding, extract knowledge and formulate actionable results.

Hmm. Sounds like my Ph.D. where I used Artificial Neural Networks on a medical regression problem and spent half my time investigating the data and its representation to the ANN would fall under this definition. In fact, it does and this is a fairly typical example problem domain and approach to a data science project, i.e. via machine learning techniques. I should add also at this point that machine learning black box techniques can be considered as equivalent to a variety of statistical techniques. There is nothing magic going on but they do provide an alternate path to generating results requiring less mathematical/ statistical expertise of the analyst.

Given this, could it be said that ‘data scientist’ = ‘statistician’. Yes. It could. But the detail is a little more nuanced. Other possible phrases utilised in this space are ‘data mining’, knowledge discovery’ and ‘big data’. To a large extent these are just different buzzwords that have been used at different times but they have the same underlying processes:

Here are the steps of the Cross Industry Standard Process for Data Mining (CRISP-DM) from 2000, for example:
  1. Business Understanding – identify project objective
  2. Data understanding – collect and review data
  3. Data preparation – select and cleanse data
  4. Modeling- manipulate data and draw conclusions
  5. Evaluation – evaluate model and conclusions
  6. Deployment – apply conclusions to business
Noting that the process is not linear but iterative. For example, model evaluation would feedback into the state of understanding of the data, possibly impacting on data preparation and refining of the model.

As you may have surmised, a very similar process applies to today’s Data Scientists.

What kind of problems are common targets? Considering very briefly: Typically, predictive modelling using machine learning or statistical modelling and, again typically, these will be of one of the following types:
  • Yes/ no Classification, e.g. is this an animal
  • Value estimation (regression); your typical supervised learning problem. E.g. what will the weight of an unborn baby be
  • Grouping of observations (clustering); your typical unsupervised learning problem
  • Recommender systems, e.g. recommending a film based on previous viewing habit
Key to evaluating the model in a supervised learning system is testing of performance against a test data set, distinct from the training data set. You must also be careful not to overfit the data by overtraining against the training data.
Do you have a data set in need of analysis? Drop us an email!


Data Science and Machine Learning Essentials  
Advanced | Published: 02 November 2015
Instructor(s): Steve Elston and Cynthia Rudin
Back to News