Ray for Practitioners

Learn how to leverage Ray with hands-on instructor-led virtual training

Description

AI, ML and Data professionals from all disciplines will benefit from this comprehensive introduction to the components of the Anyscale and Ray platforms that directly support deploying scalable AI and data processing applications into production. You’ll leverage Python and Ray to define and schedule AI tasks that efficiently handle complex computations and data transformations, powering analytic applications and AI-driven solutions. This course offers hands-on instruction in Anyscale's Unified AI Platform, Ray AI Libraries, Ray Serve, and methods to ensure optimal resource management and fault tolerance.

Audience

This course is designed for anyone looking for a holistic ML/AI and data processing solution.

  • Machine learning practitioners

  • Platform engineers

  • Software engineers

Prerequisites

  • Familiarity with AI use cases.
  • Familiarity with basic ML concepts and workflows.
  • Familiarity with Python, JupyterLab notebooks, VSCode.
  • Ability to perform basic code development tasks in Python.

Course Outline

Why Ray? Why Anyscale? Anyscale overview

Module 1: Why Ray and Why Anyscale

  • Introduction to Anyscale and Ray

  • The AI Complexity Wall and How Ray Helps

  • Overview of the Anyscale Unified AI Platform

  • Introduction to Ray Turbo

Module 2: Anyscale Overview

  • Understanding Anyscale Workspaces and compute resources

  • Monitoring and debugging Ray applications

  • Configurations

  • Production Jobs

  • Production Services

Ray AI Libraries Overview

Module 1: Introduction to the Ray AI Libraries

  • Overview of the Ray AI Libraries

  • Quick end-to-end example with XGBoost

Module 2: Introduction to Ray Train

  • Single GPU PyTorch

  • Overview of the training loop in Ray Train

  • Migrating the model to Ray Train

  • Migrating the dataset to Ray Train

  • Reporting metrics and checkpoints

  • Launching the distributed training job

  • Accessing training results

  • Ray Train in production

Module 3: Introduction to Ray Tune

  • Loading the data

  • Starting out with vanilla PyTorch

  • Hyperparameter tuning with Ray Tune

  • Ray Tune in production

Module 4: Introduction to Ray Data

  • When to use Ray Data

  • Loading Data

  • Transforming Data

  • Materializing Data

  • Data Operations: Grouping, Aggregation, and Shuffling

  • Persisting Data

  • Ray Data in Production

Module 5: Introduction to Ray Serve

  • Introduction to Ray Serve

  • Overview of Ray Serve

  • Implement an MNISTClassifier service

  • Advanced features of Ray Serve

  • Ray Serve in Production

Module 6: Introduction to Ray Core

  • Ray Core overview

  • @ray.remote and ray.get()

  • Ray Task can launch other tasks

  • Ray Actors

Part 3: Ray Train for distributed model training

Module 1: Ray Train Deep Dive

  • Recap from the previous day

  • Integrating Ray Train with Ray Data

  • Fault tolerance in Ray Train

  • Integration with Lightning

Module 2: Ray Train Observability

  • Starting with a sample distributed training loop

  • Using the Ray dashboard

  • Profiling the training loop

  • Adding Ray Data

Module 3: Tuning Configs for Cost and Performance

  • Memory capacity vs speed trade-off

  • How to benchmark throughput

  • Importance of Empirical Benchmarks

Module 4: Debugging Ray Train common failures

  • Trainer hang

  • Data loading bottlenecks

  • Resource over-provisioning for data preprocessing

  • Throughput debugging and profiling

Ray Data for ML inference and data preprocessing

Module 1: Ray Data for Batch Inference

  • Recap from the previous day

  • Batch inference

  • Loading Data

  • Transforming Data

  • Generating Predictions

  • Materializing Data

Module 2: Ray Data Architecture

  • Streaming execution

  • Dataset and blocks

  • Ray memory model

  • Operators and planning

  • Streaming topology

  • Data flow within an operator

  • Streaming executor's scheduling loop

  • Resource management and allocation

Module 3: Diagnosing ray data

  • Brief about the workload

  • General Diagnostics

  • Scenario 1: Under-utilization of GPUs

  • Scenario 2: Disk Spilling

Applications and Ray Reference Architecture - stable diffusion pre-training

Module 1: Stable Diffusion and Ray

  • A simple data pipeline

  • Introduction to Ray Data

  • Batch Inference with Stable Diffusion

  • Stable Diffusion under the hood

Module 2: Primer on Stable Diffusion

  • Pre-training of a Stable Diffusion Model

  • Data pre-processing in more detail

  • Compute requirements for pre-processing and training

Module 3: Pre-processing for Stable Diffusion

  • High-level overview of the preprocessing pipeline

  • Reading in the data

  • Transforming images and captions

  • Encoding of images and captions

  • Writing out the preprocessed data

Module 4: Distributed Training for Stable Diffusion

  • Load the preprocessed data into a Ray Dataset

  • Define a stable diffusion model

  • Define a PyTorch Lightning training loop

  • Migrate the training loop to Ray Train

  • Create and fit a Ray Train TorchTrainer

  • Fault Tolerance in Ray Train

Module 5: Distributed Training Optimizations for Stable Diffusion

  • Using Fully Sharded Data Parallel (FSDP)

  • Online (end-to-end) preprocessing and training


    $1,500

    • 14 Hours
    • 5 half-day virtual instructor-led sessions
    This course also includes:
    • Live virtual instructor
    • Hands-on exercises
    • Access to workspace environment
    • Certificate of completion (upon request)
    Public Classes Schedule Request Private Class