In the Apache Spark 2.x releases, Machine Learning (ML) is focusing on DataFrame-based APIs. This webinar is aimed at helping users take full advantage of the new APIs. Topics will include migrating workloads from RDDs to DataFrames, ML persistence for saving and loading models, and the roadmap ahead.

Migrating ML workloads to use Spark DataFrames and Datasets allows users to benefit from simpler APIs, plus speed and scalability improvements.  As the DataFrame/Dataset API becomes the primary API for data in Spark, this migration will become increasingly important to MLlib users, especially for integrating ML with the rest of Spark data processing workloads.  We will give a tutorial covering best practices and some of the immediate and future benefits to expect.

ML persistence is one of the biggest improvements in the DataFrame-based API.  With Spark 2.0, almost all ML algorithms can be saved and loaded, even across languages. ML persistence dramatically simplifies collaborating across teams and moving ML models to production. We will demonstrate how to use persistence, and we will discuss a few existing issues and workarounds.

At the end of the webinar, we will discuss major roadmap items. These include API coverage, major speed and scalability improvements to certain algorithms, and integration with structured streaming.

Presenters

Joseph Bradley

Software Engineer - Databricks

Joseph Bradley is a Software Engineer and Spark Committer working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.

Jules S. Damji

Jules S. Damji

Spark Community Evangelist - Databricks

Jules S. Damji is a Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies building large-scale distributed systems.