Ray Train

ML library for distributed model training. Anyscale supports and further optimizes Ray Train for improved performance, reliability, and scale.

12x

faster iteration for companies like Canva

60s

node startup and autoscaling

up to

60%

lower costs on many workloads (vs open source Ray) through spot instance and elastic training support

50%

reduction in cloud costs for companies like Canva

What is Ray Train?

Ray Train is an open source machine learning library built on top of Ray, a best-in-class distributed compute platform for AI/ML workloads.

Ray Train integrates with your preferred training frameworks, including PyTorch, Hugging Face, Tensorflow, XGBoost, and more—so you can develop with your preferred tech stack, then scale to the cloud with just one line of code.

Use Cases

Distributed Training

Increase training iteration speed without increasing cost by implementing distributed training on Anyscale. Easily scale from your laptop to any number of GPUs with just one line of code.

Benefits

Set-it-and-Forget-it Training

Ray Train includes built-in checkpointing to reduce compute. Easily recover from system failures and resume training from a recent checkpoint.

Faster Iteration, Same Cost

Train with parallelized compute to complete training jobs faster. Increase iteration speed with the ability to scale across nodes during development.

Maximize GPU and CPU Utilization

Leverage CPUs and GPUs in the same pipeline with to increase GPU utilization and decrease costs

Compatible with Any Training Framework

Integrate with training frameworks like PyTorch, Hugging Face, Tensorflow, and more. Develop with your preferred tech stack, then scale to the cloud with just one line of code.

Supercharge Ray Train with Anyscale

Reduce Costs with Spot Instances

Anyscale’s Ray Train automatically recovers with minimal interruption from spot instance preemption and node failure—while reducing costs by up to 90%.

Easily Get Started with Distributed Training at Scale

Elastic Training & Spot Instance Support

-

Job Retries & Fault Tolerance Support

-

Fast Node Launching and Autoscaling

-

60 sec

Fractional Heterogeneous Resource Allocation

-

Detailed Training Dashboard

-

Last-Mile Data Preprocessing

-

Autoscaling Development Environment

-

Distributed Debugger

-

Data Integrations (Databricks, Snowflake, S3, GCS, etc)

Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)

Experiment Tracking Integrations (Weights and Biases, MLflow, etc)

Orchestration Integrations (Prefect, Apache Airflow, etc)

Alerting

-

Resumable Jobs

Priority Scheduling

-

Job Queues

-

EFA Support

Custom


Elastic Training & Spot Instance Support	- -	- -
Job Retries & Fault Tolerance Support		- -
Fast Node Launching and Autoscaling	- -	- -	60 sec 60 sec
Fractional Heterogeneous Resource Allocation	- -
Detailed Training Dashboard	- -	- -
Last-Mile Data Preprocessing	- -
Autoscaling Development Environment	- -	- -
Distributed Debugger	- -	- -
Data Integrations (Databricks, Snowflake, S3, GCS, etc)
Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)
Experiment Tracking Integrations (Weights and Biases, MLflow, etc)
Orchestration Integrations (Prefect, Apache Airflow, etc)
Alerting		- -
Resumable Jobs
Priority Scheduling	- -
Job Queues		- -
EFA Support		Custom Custom

Out-of-the-Box Templates & App Accelerators

Jumpstart your development process with custom-made templates, only available on Anyscale.

End-to-End LLM Workflows

Execute end-to-end LLM workflows to develop and productionize LLMs at scale

Pre-Train Stable Diffusion

Pre-train a Stable Diffusion V2 model with Ray Train and Ray Data

Fine-Tune Stable Diffusion

Fine-tune a personalized Stable Diffusion XL model with Ray Train

“We have no ceiling on scale, and an incredible opportunity to bring AI features and value to our 170 million users.”

Greg Roodt
ML Lead, Canva

Related Resources

Learn more about why Anyscale’s Ray Train is the leader for distributed model training at scale.

What is Ray Train?

Get an in-depth look at Ray Train and answer frequently asked questions about how to get started.

3X Cheaper Stable Diffusion Training

Anyscale is the best option for Enterprise stable diffusion. Get 3X cheaper training without sacrificing quality or speed.

Many-Model Batch Training at Scale

Use Ray Core and stateless Ray tasks to batch train large amounts of data across multiple models.

Free Template: Parallel Experiment Basics

Free Notebook template to help you get started running experiments in parallel with Anyscale.

FAQs

Ray Train is an open source machine learning library built on top of Ray, a best-in-class distributed compute platform for AI/ML workloads. Anyscale, built by the creators of Ray, offers additional proprietary enhancements on top of open source Ray Train, like:

Elastic training
Spot instance support
Enhanced fault tolerance
Resumable jobs
And much more

Distributed AI Model Training at Scale

Enable simple, fast, and affordable distributed model training with Anyscale. Learn more, or get started today.

Ray Train

12x

60s

up to

60%

50%

What is Ray Train?

Use Cases

Distributed Training

Benefits

Set-it-and-Forget-it Training

Faster Iteration, Same Cost

Maximize GPU and CPU Utilization

Compatible with Any Training Framework

Supercharge Ray Train with Anyscale

Reduce Costs with Spot Instances

Advanced Observability

Integrated Jobs and Job Queues

Reduce Costs with Spot Instances

Easily Get Started with Distributed Training at Scale

Elastic Training & Spot Instance Support

Job Retries & Fault Tolerance Support

Fast Node Launching and Autoscaling

Fractional Heterogeneous Resource Allocation

Detailed Training Dashboard

Last-Mile Data Preprocessing

Autoscaling Development Environment

Distributed Debugger

Data Integrations (Databricks, Snowflake, S3, GCS, etc)

Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)

Experiment Tracking Integrations (Weights and Biases, MLflow, etc)

Orchestration Integrations (Prefect, Apache Airflow, etc)

Alerting

Resumable Jobs

Priority Scheduling

Job Queues

EFA Support

Elastic Training & Spot Instance Support

Job Retries & Fault Tolerance Support

Fast Node Launching and Autoscaling

Fractional Heterogeneous Resource Allocation

Detailed Training Dashboard

Last-Mile Data Preprocessing

Autoscaling Development Environment

Distributed Debugger

Data Integrations (Databricks, Snowflake, S3, GCS, etc)

Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)

Experiment Tracking Integrations (Weights and Biases, MLflow, etc)

Orchestration Integrations (Prefect, Apache Airflow, etc)

Alerting

Resumable Jobs

Priority Scheduling

Job Queues

EFA Support

Out-of-the-Box Templates & App Accelerators

End-to-End LLM Workflows

Pre-Train Stable Diffusion

Fine-Tune Stable Diffusion

Related Resources

What is Ray Train?

3X Cheaper Stable Diffusion Training

Many-Model Batch Training at Scale

Free Template: Parallel Experiment Basics

FAQs

Distributed AI Model Training at Scale