Your Solution to Distributed Model Training

Increase iteration speed—without increasing cost. Anyscale makes it easy to scale from a single machine to a cloud cluster with just one line of code.

Model Training 800 x 500

Why Anyscale?

Reduce Costs with Spot Instances

With elastic training on Anyscale, train with minimal interruption from spot instance preemption and node failure—while reducing costs by up to 90%.

Faster Iteration, Same Cost

Get the same results—faster. Train with parallelized compute to complete training in minutes, rather than hours. Increase iteration speed with the ability to scale across nodes during development.

Improve Model Quality By Training on All Your Data

Train higher quality models by training on all of your data—not just a subset.

One Unified API

Easily scale your training for any machine learning library—from XGBoost to Tensorflow to PyTorch and more.

DoorDash-logo 1

How DoorDash Built Reliable Delivery Forecasting with Ray

Balancing supply and demand is crucial for marketplace app health—and DoorDash relies on Ray to get it done. See how Doordash achieved 46% reduction in training time and 42% reduction in cost with Ray.

Easily Get Started with Distributed Training at Scale

Elastic Training & Spot Instance Support

Apache Spark
-
Ray
-
anyscale blue

Job Retries & Fault Tolerance Support

Apache Spark
Ray
-
anyscale blue

Fast Node Launching and Autoscaling

Apache Spark
-
Ray
-
anyscale blue
60 sec

Fractional Heterogeneous Resource Allocation

Apache Spark
-
Ray
anyscale blue

Detailed Training Dashboard

Apache Spark
-
Ray
-
anyscale blue

Last-Mile Data Preprocessing

Apache Spark
-
Ray
anyscale blue

Autoscaling Development Environment

Apache Spark
-
Ray
-
anyscale blue

Distributed Debugger

Apache Spark
-
Ray
-
anyscale blue

Data Integrations (Databricks, Snowflake, S3, GCS, etc)

Apache Spark
Ray
anyscale blue

Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)

Apache Spark
Ray
anyscale blue

Experiment Tracking Integrations (Weights and Biases, MLflow, etc)

Apache Spark
Ray
anyscale blue

Orchestration Integrations (Prefect, Apache Airflow, etc)

Apache Spark
Ray
anyscale blue

Alerting

Apache Spark
Ray
-
anyscale blue

Resumable Jobs

Apache Spark
Ray
anyscale blue

Priority Scheduling

Apache Spark
-
Ray
anyscale blue

Job Queues

Apache Spark
Ray
-
anyscale blue

EFA Support

Apache Spark
Ray
Custom
anyscale blue
Apache Spark
Ray
anyscale blue

Elastic Training & Spot Instance Support

Apache Spark
-
Ray
-
anyscale blue

Job Retries & Fault Tolerance Support

Apache Spark
Ray
-
anyscale blue

Fast Node Launching and Autoscaling

Apache Spark
-
Ray
-
anyscale blue
60 sec

Fractional Heterogeneous Resource Allocation

Apache Spark
-
Ray
anyscale blue

Detailed Training Dashboard

Apache Spark
-
Ray
-
anyscale blue

Last-Mile Data Preprocessing

Apache Spark
-
Ray
anyscale blue

Autoscaling Development Environment

Apache Spark
-
Ray
-
anyscale blue

Distributed Debugger

Apache Spark
-
Ray
-
anyscale blue

Data Integrations (Databricks, Snowflake, S3, GCS, etc)

Apache Spark
Ray
anyscale blue

Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)

Apache Spark
Ray
anyscale blue

Experiment Tracking Integrations (Weights and Biases, MLflow, etc)

Apache Spark
Ray
anyscale blue

Orchestration Integrations (Prefect, Apache Airflow, etc)

Apache Spark
Ray
anyscale blue

Alerting

Apache Spark
Ray
-
anyscale blue

Resumable Jobs

Apache Spark
Ray
anyscale blue

Priority Scheduling

Apache Spark
-
Ray
anyscale blue

Job Queues

Apache Spark
Ray
-
anyscale blue

EFA Support

Apache Spark
Ray
Custom
anyscale blue

Set-it-and-Forget-it Model Training

With built-in fault tolerance and automatic job retries, Anyscale will ensure your training job completes regardless of any errors. Easily recover from system failures and resume training from a recent checkpoint.

Faster time to market

Training Utilization Dashboard

Gain insight into your distributed training job progress and track utilization to ensure you’re getting the most out of your compute resources.

Observability 400 x 250 white background

Reduce Costs By Using GPUs and CPUs

Anyscale makes it easy to leverage heterogeneous compute. Use CPUs and GPUs in the same pipeline to increase utilization, fully saturate GPUs, and decrease costs.

How Uber Did Distributed GPU Training with Ray

See why Uber transitioned from Spark-based XGBoost to Ray for better scalability and reliability for distributed GPU training and tuning.

Out-of-the-Box Templates & App Accelerators

Jumpstart your development process with custom-made templates, only available on Anyscale.

End-to-End LLM Workflows

Execute end-to-end LLM workflows to develop and productionize LLMs at scale

Fine-Tune Stable Diffusion

Fine-tune a personalized Stable Diffusion XL model with Ray Train

Pre-Train Stable Diffusion

Pre-train a Stable Diffusion V2 model with Ray Train and Ray Data

CAnva

“We have no ceiling on scale, and an incredible opportunity to bring AI features and value to our 170 million users.”

Greg Roodt
ML Lead, Canva

FAQs

Anyscale's Ray Train integrates with your preferred training frameworks, including PyTorch, Hugging Face, Tensorflow, and more.

Distributed AI Model Training at Scale

Enable simple, fast, and affordable distributed model training with Anyscale. Learn more, or get started today.