Optimized Performance

Throughput-Optimized Workloads

We know how important throughput is for your large scale, offline batch inference jobs—and we’ve optimized Anyscale accordingly.

Advanced vLLM Optimizations

The Anyscale inference team consists of many of the leading committers to the vLLM project. Our experts can help tune engine performance to reduce costs.

Long-Context Use Cases

Our custom optimizations for prefix caching enable significant performance improvements on long-context use cases compared to vLLM.

Reduce Costs By Using GPUs and CPUs

Anyscale makes it easy to leverage heterogeneous compute. Use CPUs and GPUs in the same pipeline to increase utilization, fully saturate GPUs, and decrease costs.

Looking For LLM Online Inference?

Check out our dedicated LLM online inference page to see how Anyscale supports LLM serving at scale.

Optimized LLM Batch Inference at Any Scale

Automated Throughput Tuner

N/A

-

Support for Different GPUs/Accelerators

N/A

-

Support for Large Model Parallelism

-

Spot Instance Support

N/A

Accelerated Long Context Inference

-

Custom Optimized Kernels

N/A

-

Multi-Modal Support

-


Automated Throughput Tuner	N/A N/A	- -	- -
Support for Different GPUs/Accelerators	N/A N/A	- -
Support for Large Model Parallelism		- -
Spot Instance Support	N/A N/A
Accelerated Long Context Inference	- -	- -	- -
Custom Optimized Kernels	N/A N/A	- -	- -
Multi-Modal Support	- -

Best-Price Performance

We’ve optimized our inference engine so you don’t have to.

6.1x cost savings compared to Amazon Bedrock
90% cost savings on select instances with spot instances and fault-tolerant continuous batching

Scale Your Datasets and Models

Anyscale supports tensor parallelism, data parallelism, and pipeline parallelism so you can use any GPU and any model for your workload—including more available cost efficient options like A10 and L4 accelerators and models like Llama-3.1-8B or Llama-3.1-405B.

“We have no ceiling on scale, and an incredible opportunity to bring AI features and value to our 170 million users.”

Greg Roodt
ML Lead, Canva

Out-of-the-Box Templates & App Accelerators

Jumpstart your development process with custom-made templates, only available on Anyscale.

Batch Inference with LLMs

Run LLM offline inference on large scale input data with Ray Data

End-to-End LLM Workflows

Execute end-to-end LLM workflows to develop and productionize LLMs at scale

Related Resources

Learn more about why Anyscale is the best option for LLM batch inference.

Webinar: End-to-End LLM Workflows

Master the end-to-end LLM process with our exclusive webinar walkthrough. Get tips and best practices from Anyscale leaders and expert practitioners.

Anyscale for Unstructured Data Processing

Learn more about why Anyscale is best-in-class for unstructured data processing.

Offline Batch Inference: Ray vs. Spark vs. Sagemaker

See how we achieved 17X faster batch inference than SageMaker Batch Transform and 2X faster speeds than Apache Spark.

How ByteDance Scales Offline Inference with Multi-Modal LLMs

Discover how ByteDance, the company behind TikTok, runs back inference across >200TB of data.

Offline Batch Inference at Scale

See why Anyscale is the best option for offline batch inference.

Best-in-Class LLM Batch Inference

Optimized Performance

Throughput-Optimized Workloads

Advanced vLLM Optimizations

Long-Context Use Cases

Reduce Costs By Using GPUs and CPUs

Looking For LLM Online Inference?

Optimized LLM Batch Inference at Any Scale

Automated Throughput Tuner

Support for Different GPUs/Accelerators

Support for Large Model Parallelism

Spot Instance Support

Accelerated Long Context Inference

Custom Optimized Kernels

Multi-Modal Support

Automated Throughput Tuner

Support for Different GPUs/Accelerators

Support for Large Model Parallelism

Spot Instance Support

Accelerated Long Context Inference

Custom Optimized Kernels

Multi-Modal Support

Best-Price Performance

Scale Your Datasets and Models

Out-of-the-Box Templates & App Accelerators

Batch Inference with LLMs

End-to-End LLM Workflows

FAQs

Offline Batch Inference at Scale

Best-in-Class LLM Batch Inference

Optimized Performance

Throughput-Optimized Workloads

Advanced vLLM Optimizations

Long-Context Use Cases

Reduce Costs By Using GPUs and CPUs

Looking For LLM Online Inference?

Optimized LLM Batch Inference at Any Scale

Automated Throughput Tuner

Support for Different GPUs/Accelerators

Support for Large Model Parallelism

Spot Instance Support

Accelerated Long Context Inference

Custom Optimized Kernels

Multi-Modal Support

Automated Throughput Tuner

Support for Different GPUs/Accelerators

Support for Large Model Parallelism

Spot Instance Support

Accelerated Long Context Inference

Custom Optimized Kernels

Multi-Modal Support

Best-Price Performance

Scale Your Datasets and Models

Out-of-the-Box Templates & App Accelerators

Batch Inference with LLMs

End-to-End LLM Workflows

Related Resources

Webinar: End-to-End LLM Workflows

Anyscale for Unstructured Data Processing

Offline Batch Inference: Ray vs. Spark vs. Sagemaker

How ByteDance Scales Offline Inference with Multi-Modal LLMs

FAQs

Why is Anyscale offline batch inference better than competitors?

Can I do full end-to-end LLM workflows on Anyscale?

What is Ray Data?

How much does Anyscale cost?

Offline Batch Inference at Scale