Why Anyscale?

Reduce Costs with Spot Instances

With elastic training on Anyscale, train with minimal interruption from spot instance preemption and node failure—while reducing costs by up to 90%.

Faster Iteration, Same Cost

Get the same results—faster. Train with parallelized compute to complete training in minutes, rather than hours. Increase iteration speed with the ability to scale across nodes during development.

Improve Model Quality By Training on All Your Data

Train higher quality models by training on all of your data—not just a subset.

One Unified API

Easily scale your training for any machine learning library—from XGBoost to Tensorflow to PyTorch and more.

How DoorDash Built Reliable Delivery Forecasting with Ray

Balancing supply and demand is crucial for marketplace app health—and DoorDash relies on Ray to get it done. See how Doordash achieved 46% reduction in training time and 42% reduction in cost with Ray.

Easily Get Started with Distributed Training at Scale

Elastic Training & Spot Instance Support

-

Job Retries & Fault Tolerance Support

-

Fast Node Launching and Autoscaling

-

60 sec

Fractional Heterogeneous Resource Allocation

-

Detailed Training Dashboard

-

Last-Mile Data Preprocessing

-

Autoscaling Development Environment

-

Distributed Debugger

-

Data Integrations (Databricks, Snowflake, S3, GCS, etc)

Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)

Experiment Tracking Integrations (Weights and Biases, MLflow, etc)

Orchestration Integrations (Prefect, Apache Airflow, etc)

Alerting

-

Resumable Jobs

Priority Scheduling

-

Job Queues

-

EFA Support

Custom


Elastic Training & Spot Instance Support	- -	- -
Job Retries & Fault Tolerance Support		- -
Fast Node Launching and Autoscaling	- -	- -	60 sec 60 sec
Fractional Heterogeneous Resource Allocation	- -
Detailed Training Dashboard	- -	- -
Last-Mile Data Preprocessing	- -
Autoscaling Development Environment	- -	- -
Distributed Debugger	- -	- -
Data Integrations (Databricks, Snowflake, S3, GCS, etc)
Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)
Experiment Tracking Integrations (Weights and Biases, MLflow, etc)
Orchestration Integrations (Prefect, Apache Airflow, etc)
Alerting		- -
Resumable Jobs
Priority Scheduling	- -
Job Queues		- -
EFA Support		Custom Custom

Set-it-and-Forget-it Model Training

With built-in fault tolerance and automatic job retries, Anyscale will ensure your training job completes regardless of any errors. Easily recover from system failures and resume training from a recent checkpoint.

Training Utilization Dashboard

Gain insight into your distributed training job progress and track utilization to ensure you’re getting the most out of your compute resources.

Observability 400 x 250 white background

Reduce Costs By Using GPUs and CPUs

Anyscale makes it easy to leverage heterogeneous compute. Use CPUs and GPUs in the same pipeline to increase utilization, fully saturate GPUs, and decrease costs.

How Uber Did Distributed GPU Training with Ray

See why Uber transitioned from Spark-based XGBoost to Ray for better scalability and reliability for distributed GPU training and tuning.

Out-of-the-Box Templates & App Accelerators

Jumpstart your development process with custom-made templates, only available on Anyscale.

End-to-End LLM Workflows

Execute end-to-end LLM workflows to develop and productionize LLMs at scale

Fine-Tune Stable Diffusion

Fine-tune a personalized Stable Diffusion XL model with Ray Train

Pre-Train Stable Diffusion

Pre-train a Stable Diffusion V2 model with Ray Train and Ray Data

“We have no ceiling on scale, and an incredible opportunity to bring AI features and value to our 170 million users.”

Greg Roodt
ML Lead, Canva

Related Resources

Learn more about why Anyscale is the best option for distributed model training.

3X Cheaper Stable Diffusion Training

Anyscale is the best option for Enterprise stable diffusion. Get 3X cheaper training without sacrificing quality or speed.

Many-Model Batch Training at Scale

Use Ray Core and stateless Ray tasks to batch train large amounts of data across multiple models.

Training 1M ML Models in Record Time with Ray

Learn how to train many ML models without breaking the bank.

Free Template: Parallel Experiment Basics

Free Notebook template to help you get started running experiments in parallel with Anyscale.

FAQs

Anyscale's Ray Train integrates with your preferred training frameworks, including PyTorch, Hugging Face, Tensorflow, and more.

Distributed AI Model Training at Scale

Enable simple, fast, and affordable distributed model training with Anyscale. Learn more, or get started today.

Your Solution to Distributed Model Training

Why Anyscale?

Reduce Costs with Spot Instances

Faster Iteration, Same Cost

Improve Model Quality By Training on All Your Data

One Unified API

How DoorDash Built Reliable Delivery Forecasting with Ray

Easily Get Started with Distributed Training at Scale

Elastic Training & Spot Instance Support

Job Retries & Fault Tolerance Support

Fast Node Launching and Autoscaling

Fractional Heterogeneous Resource Allocation

Detailed Training Dashboard

Last-Mile Data Preprocessing

Autoscaling Development Environment

Distributed Debugger

Data Integrations (Databricks, Snowflake, S3, GCS, etc)

Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)

Experiment Tracking Integrations (Weights and Biases, MLflow, etc)

Orchestration Integrations (Prefect, Apache Airflow, etc)

Alerting

Resumable Jobs

Priority Scheduling

Job Queues

EFA Support

Elastic Training & Spot Instance Support

Job Retries & Fault Tolerance Support

Fast Node Launching and Autoscaling

Fractional Heterogeneous Resource Allocation

Detailed Training Dashboard

Last-Mile Data Preprocessing

Autoscaling Development Environment

Distributed Debugger

Data Integrations (Databricks, Snowflake, S3, GCS, etc)

Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)

Experiment Tracking Integrations (Weights and Biases, MLflow, etc)

Orchestration Integrations (Prefect, Apache Airflow, etc)

Alerting

Resumable Jobs

Priority Scheduling

Job Queues

EFA Support

Set-it-and-Forget-it Model Training

Training Utilization Dashboard

Reduce Costs By Using GPUs and CPUs

How Uber Did Distributed GPU Training with Ray

Out-of-the-Box Templates & App Accelerators

End-to-End LLM Workflows

Fine-Tune Stable Diffusion

Pre-Train Stable Diffusion

Related Resources

3X Cheaper Stable Diffusion Training

Many-Model Batch Training at Scale

Training 1M ML Models in Record Time with Ray

Free Template: Parallel Experiment Basics

FAQs

Distributed AI Model Training at Scale