FOSSY 2024 | Presentation: Getting ML Right in a Complex Data World

Presented by

Oz Katz
@ozkatz100
https://lakefs.io

Oz Katz is the Co-Creator of the open source lakeFS Project, an open source platform that delivers resilience and manageability to object-storage based data lakes, as well as the CTO and co-founder of Treeverse, the company behind lakeFS. Oz engineered and maintained petabyte-scale data infrastructure at analytics giant SmilarWeb, which he joined after the acquisition of Swayy.

Abstract

Machine learning workflows are iterative & repetitive to and from multiple steps including data labeling, data cleaning, preprocessing and feature selection methods during model training, just to arrive at an accurate model. Quality ML at scale is only possible when we can reproduce a specific iteration of the ML experiment–and this is where data is key. This means: capturing the version of training data, ML code and model artifacts at each iteration is mandatory. However, to efficiently version ML experiments without duplicating code, data and models, data versioning tools are required. Open source tools like lakeFS make it possible to version all components of ML experiments without the need to keep multiple copies, and as an added benefit, save you storage costs as well. In this talk, you will learn how to use a data versioning engine to intuitively and easily version your ML experiments and reproduce any specific iteration of the experiment. This talk will demo through a live code example: • Creating a basic ML experimentation framework with lakeFS (on Jupyter notebook) • Reproducing ML components from a specific iteration of an experiment • Building intuitive, zero-maintenance experiments infrastructure All with common OSS data engineering stacks & open source tooling.