Workshop | Holistic machine learning: From finding your data to deploying your model

Mark Whitehorn

This is a workshop for people who are about to start using ML and need to both understand and see the entire process from start to finish.

Machine Learning (ML) is not an isolated, single step; it is a process and all of the steps in that process are vital. In brief the steps are often:

Select the data

Understand it

Clean it

Repeat

Select an ML algorithm

Prepare the data

Repeat (with model)

Train

Test

Evaluate

Until best model for this algorithm found

Until best overall model found

Possibly combine models into an ensemble

 

Some of these steps sound really exciting (training the model) and some painfully boring (preparing data). But all are equally important; neglect any and you are wasting your time.

This workshop will start with some data. We will discuss it in order to understand it and then work through the entire process. For the majority of the workshop this will be much more of a discussion and demonstration than actual hands-on. During the last two hours, once you have seen the entire process, you will be given all the code and data so that you can practice any (or all) of the steps that you have seen. But the main idea of the workshop is to walk you through the process and to show you WHY all these steps are necessary and how you can complete them.

Of course we intend to make it highly interactive. For example, we don’t just explain why data has to be prepared for certain algorithms, will show you examples. Then we will show you more and get you to suggest changes. As another example, we often use ROC curves in the evaluation stage. So we will first explain what ROC curves are, show you how to create them and then apply them to our sample models.

So, this workshop is decidedly not suitable for people who are already doing ML well and are happy with the results. However, we hope it will be incredibly useful for people who are about to start using ML and need to both understand and see the entire process from start to finish.

Since this workshop is essentially about good practice it is largely tool-agnostic but the demonstrations will be run in both R and Python. We will also demonstrate the use of decision trees, clustering and neural nets.

Location:   Date: October 17, 2018 Time: Kate Kilgour, University of Dundee Mark Whitehorn Prof. Mark Whitehorn