Ray Train

ML library for distributed model training. Anyscale supports and further optimizes Ray Train for improved performance, reliability, and scale.

12x

faster iteration for companies like Canva

60s

node startup and autoscaling

up to

60%

lower costs on many workloads (vs open source Ray) through spot instance and elastic training support

50%

reduction in cloud costs for companies like Canva

What is Ray Train?

Ray Train is an open source machine learning library built on top of Ray, a best-in-class distributed compute platform for AI/ML workloads.

Ray Train integrates with your preferred training frameworks, including PyTorch, Hugging Face, Tensorflow, XGBoost, and more—so you can develop with your preferred tech stack, then scale to the cloud with just one line of code.

Train Map Small

Use Cases

Distributed Training

Increase training iteration speed without increasing cost by implementing distributed training on Anyscale. Easily scale from your laptop to any number of GPUs with just one line of code.

Distributed Training

Benefits

Set-it-and-Forget-it Training

Ray Train includes built-in checkpointing to reduce compute. Easily recover from system failures and resume training from a recent checkpoint.

Faster Iteration, Same Cost

Train with parallelized compute to complete training jobs faster. Increase iteration speed with the ability to scale across nodes during development.

Maximize GPU and CPU Utilization

Leverage CPUs and GPUs in the same pipeline with to increase GPU utilization and decrease costs

Compatible with Any Training Framework

Integrate with training frameworks like PyTorch, Hugging Face, Tensorflow, and more. Develop with your preferred tech stack, then scale to the cloud with just one line of code.

Supercharge Ray Train with Anyscale

Reduce Costs with Spot Instances

Anyscale’s Ray Train automatically recovers with minimal interruption from spot instance preemption and node failure—while reducing costs by up to 90%.

Money

Easily Get Started with Distributed Training at Scale

Elastic Training & Spot Instance Support

Apache Spark
-
Ray
-
anyscale blue

Job Retries & Fault Tolerance Support

Apache Spark
Ray
-
anyscale blue

Fast Node Launching and Autoscaling

Apache Spark
-
Ray
-
anyscale blue
60 sec

Fractional Heterogeneous Resource Allocation

Apache Spark
-
Ray
anyscale blue

Detailed Training Dashboard

Apache Spark
-
Ray
-
anyscale blue

Last-Mile Data Preprocessing

Apache Spark
-
Ray
anyscale blue

Autoscaling Development Environment

Apache Spark
-
Ray
-
anyscale blue

Distributed Debugger

Apache Spark
-
Ray
-
anyscale blue

Data Integrations (Databricks, Snowflake, S3, GCS, etc)

Apache Spark
Ray
anyscale blue

Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)

Apache Spark
Ray
anyscale blue

Experiment Tracking Integrations (Weights and Biases, MLflow, etc)

Apache Spark
Ray
anyscale blue

Orchestration Integrations (Prefect, Apache Airflow, etc)

Apache Spark
Ray
anyscale blue

Alerting

Apache Spark
Ray
-
anyscale blue

Resumable Jobs

Apache Spark
Ray
anyscale blue

Priority Scheduling

Apache Spark
-
Ray
anyscale blue

Job Queues

Apache Spark
Ray
-
anyscale blue

EFA Support

Apache Spark
Ray
Custom
anyscale blue
Apache Spark
Ray
anyscale blue

Elastic Training & Spot Instance Support

Apache Spark
-
Ray
-
anyscale blue

Job Retries & Fault Tolerance Support

Apache Spark
Ray
-
anyscale blue

Fast Node Launching and Autoscaling

Apache Spark
-
Ray
-
anyscale blue
60 sec

Fractional Heterogeneous Resource Allocation

Apache Spark
-
Ray
anyscale blue

Detailed Training Dashboard

Apache Spark
-
Ray
-
anyscale blue

Last-Mile Data Preprocessing

Apache Spark
-
Ray
anyscale blue

Autoscaling Development Environment

Apache Spark
-
Ray
-
anyscale blue

Distributed Debugger

Apache Spark
-
Ray
-
anyscale blue

Data Integrations (Databricks, Snowflake, S3, GCS, etc)

Apache Spark
Ray
anyscale blue

Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)

Apache Spark
Ray
anyscale blue

Experiment Tracking Integrations (Weights and Biases, MLflow, etc)

Apache Spark
Ray
anyscale blue

Orchestration Integrations (Prefect, Apache Airflow, etc)

Apache Spark
Ray
anyscale blue

Alerting

Apache Spark
Ray
-
anyscale blue

Resumable Jobs

Apache Spark
Ray
anyscale blue

Priority Scheduling

Apache Spark
-
Ray
anyscale blue

Job Queues

Apache Spark
Ray
-
anyscale blue

EFA Support

Apache Spark
Ray
Custom
anyscale blue

Out-of-the-Box Templates & App Accelerators

Jumpstart your development process with custom-made templates, only available on Anyscale.

End-to-End LLM Workflows

Execute end-to-end LLM workflows to develop and productionize LLMs at scale

Pre-Train Stable Diffusion

Pre-train a Stable Diffusion V2 model with Ray Train and Ray Data

Fine-Tune Stable Diffusion

Fine-tune a personalized Stable Diffusion XL model with Ray Train

Canva Logo Black

“We have no ceiling on scale, and an incredible opportunity to bring AI features and value to our 170 million users.”

Greg Roodt
ML Lead, Canva

FAQs

Ray Train is an open source machine learning library built on top of Ray, a best-in-class distributed compute platform for AI/ML workloads. Anyscale, built by the creators of Ray, offers additional proprietary enhancements on top of open source Ray Train, like:

  • Elastic training
  • Spot instance support
  • Enhanced fault tolerance
  • Resumable jobs
  • And much more

Distributed AI Model Training at Scale

Enable simple, fast, and affordable distributed model training with Anyscale. Learn more, or get started today.