Ray for Practitioners
Learn how to leverage Ray with hands-on instructor-led virtual training
Description
AI, ML and Data professionals from all disciplines will benefit from this comprehensive introduction to the components of the Anyscale and Ray platforms that directly support deploying scalable AI and data processing applications into production. You’ll leverage Python and Ray to define and schedule AI tasks that efficiently handle complex computations and data transformations, powering analytic applications and AI-driven solutions. This course offers hands-on instruction in Anyscale's Unified AI Platform, Ray AI Libraries, Ray Serve, and methods to ensure optimal resource management and fault tolerance.
Audience
This course is designed for anyone looking for a holistic ML/AI and data processing solution.
Machine learning practitioners
Platform engineers
Software engineers
Prerequisites
- Familiarity with AI use cases.
- Familiarity with basic ML concepts and workflows.
- Familiarity with Python, JupyterLab notebooks, VSCode.
- Ability to perform basic code development tasks in Python.
Course Outline
Why Ray? Why Anyscale? Anyscale overview
Module 1: Why Ray and Why Anyscale
Introduction to Anyscale and Ray
The AI Complexity Wall and How Ray Helps
Overview of the Anyscale Unified AI Platform
Introduction to Ray Turbo
Module 2: Anyscale Overview
Understanding Anyscale Workspaces and compute resources
Monitoring and debugging Ray applications
Configurations
Production Jobs
Production Services
Ray AI Libraries Overview
Module 1: Introduction to the Ray AI Libraries
Overview of the Ray AI Libraries
Quick end-to-end example with XGBoost
Module 2: Introduction to Ray Train
Single GPU PyTorch
Overview of the training loop in Ray Train
Migrating the model to Ray Train
Migrating the dataset to Ray Train
Reporting metrics and checkpoints
Launching the distributed training job
Accessing training results
Ray Train in production
Module 3: Introduction to Ray Tune
Loading the data
Starting out with vanilla PyTorch
Hyperparameter tuning with Ray Tune
Ray Tune in production
Module 4: Introduction to Ray Data
When to use Ray Data
Loading Data
Transforming Data
Materializing Data
Data Operations: Grouping, Aggregation, and Shuffling
Persisting Data
Ray Data in Production
Module 5: Introduction to Ray Serve
Introduction to Ray Serve
Overview of Ray Serve
Implement an MNISTClassifier service
Advanced features of Ray Serve
Ray Serve in Production
Module 6: Introduction to Ray Core
Ray Core overview
@ray.remote and ray.get()
Ray Task can launch other tasks
Ray Actors
Part 3: Ray Train for distributed model training
Module 1: Ray Train Deep Dive
Recap from the previous day
Integrating Ray Train with Ray Data
Fault tolerance in Ray Train
Integration with Lightning
Module 2: Ray Train Observability
Starting with a sample distributed training loop
Using the Ray dashboard
Profiling the training loop
Adding Ray Data
Module 3: Tuning Configs for Cost and Performance
Memory capacity vs speed trade-off
How to benchmark throughput
Importance of Empirical Benchmarks
Module 4: Debugging Ray Train common failures
Trainer hang
Data loading bottlenecks
Resource over-provisioning for data preprocessing
Throughput debugging and profiling
Ray Data for ML inference and data preprocessing
Module 1: Ray Data for Batch Inference
Recap from the previous day
Batch inference
Loading Data
Transforming Data
Generating Predictions
Materializing Data
Module 2: Ray Data Architecture
Streaming execution
Dataset and blocks
Ray memory model
Operators and planning
Streaming topology
Data flow within an operator
Streaming executor's scheduling loop
Resource management and allocation
Module 3: Diagnosing ray data
Brief about the workload
General Diagnostics
Scenario 1: Under-utilization of GPUs
Scenario 2: Disk Spilling
Applications and Ray Reference Architecture - stable diffusion pre-training
Module 1: Stable Diffusion and Ray
A simple data pipeline
Introduction to Ray Data
Batch Inference with Stable Diffusion
Stable Diffusion under the hood
Module 2: Primer on Stable Diffusion
Pre-training of a Stable Diffusion Model
Data pre-processing in more detail
Compute requirements for pre-processing and training
Module 3: Pre-processing for Stable Diffusion
High-level overview of the preprocessing pipeline
Reading in the data
Transforming images and captions
Encoding of images and captions
Writing out the preprocessed data
Module 4: Distributed Training for Stable Diffusion
Load the preprocessed data into a Ray Dataset
Define a stable diffusion model
Define a PyTorch Lightning training loop
Migrate the training loop to Ray Train
Create and fit a Ray Train TorchTrainer
Fault Tolerance in Ray Train
Module 5: Distributed Training Optimizations for Stable Diffusion
Using Fully Sharded Data Parallel (FSDP)
Online (end-to-end) preprocessing and training