Machine Learning with PySpark

Kaya Kupferschmidt

Apache Spark has established itself as a standard in the big data ecosystem. Thanks to the growing number of methods from the area of machine learning, it is more than suitable to generate valid models even from large quantities of data. Moreover the integration of the modern and universal scripting language Python has helped to bring down the barriers for (soon-to-be) data scientist.

The workshop is a crash course for both Spark and machine learning. Based on examples and exercises participants will get a foundation of hands-on experience, which can be used to quickly build upon later on.

The course will draw on resources in the Amazon cloud (AWS). In this way it is ensured that participants have enough computing capacity at their disposal, regardless of their hardware. Examples and exercises will be supplied as Jupyter notebooks, so most of the work will take place in the browser.

 

Agenda

Spark introduction

Spark DataFrame API (hands-on training)

• Loading Data from S3
• Simple DataFrame Operations (Selects, …)
• SparkSQL

Spark ML architecture

Simple linear regression (hands-on training)

• Feature extraction
• Train a model
• Prediction using the model

Building Spark ML pipelines (hands-on training)

• Building pipelines
• Training pipelines
• Prediction using pipelines

Building a simple Sentiment Classifier (hands-on training)

• Bag of words model
• Feature extraction
• Training a classifier

Improving the sentiment classifier (hands-on training)

• Removing stop-words
• TF-IDF model
• Model evaluation
• Parameter selection / cross-validation

Location:   Date: October 17, 2018 Time: Kaya Kupferschmidt Kaya Kupferschmidt, dimajix