Announcing RayTurbo

By Akshay Malik, Praveen Gorthy and Richard Liaw   

We’re excited to announce Anyscale RayTurbo, an optimized Ray runtime only available on Anyscale Platform.

Screenshot 2024-09-27 at 12.12.40 PM

RayTurbo aims to provide the best price-performance and developer capabilities for AI workloads compared with other solutions including running Ray in open source. Among other optimizations, RayTurbo has optimizations to:

  1. Reduce runtime duration of read-intensive data workloads by up to 4.5x compared to open source Ray on certain workloads

  2. Accelerate end-to-end scale-up time for Llama-3-70B by up to 4.5x compared to open-source Ray on certain workloads

  3. Reduce LLM batch inference costs by up to 6x compared to repurposed online inference providers such AWS Bedrock and OpenAI

RayTurbo is focused on 4 broad workloads in the AI development lifecycle, where each of these workloads have different characteristics and require tailored optimizations.

Data processing workloads: Modern AI workloads tend to interact with significant amounts of unstructured data. Similar to other data processing systems, these workloads tend to stress the underlying system’s memory management, fault tolerance, and scalability capabilities. However, unlike other data processing systems, AI workloads tend to require high utilization of both CPU and GPU resources and heavily interact with numerical data types like arrays and tensors.

Ray Data, a distributed data system built with mixed GPU and CPU support, is now one of the most widely adopted solutions across the industry for unstructured data processing. In RayTurbo, we’ve focused on providing optimizations to Ray Data that improve both the performance and the production reliability of these workloads in comparison to the open source solution.

Training workloads: Anyscale training infrastructure, centered around Ray Train, targets users that are starting to leverage AI and are looking to train models on their own customer data. For these users, developer experience and iteration speed is most critical to improving model quality and delivering business value.

On Anyscale, we’ve focused heavily on simplifying the experience around distributed training, with the goal of making ML developers more productive. Specifically in RayTurbo, we’ve focused on providing features that improve the price-performance ratio, production monitoring, and developer experience around distributed training.

Serving workloads: For serving, Anyscale enables developers to build end to end ML applications with high performance, hardware utilization and complete production reliability. Based on Ray Serve, Anyscale offers flexible and simple APIs to express complex ML applications into production services with ease. Anyscale’s serving solution offers full flexibility over choice of hardware, cloud, modeling frameworks and inference engines. 

RayTurbo comes with improved production readiness and developer experience, along with performance optimizations for large-scale workloads and cost savings through replica compaction and spot support.

LLM workloads: RayTurbo also offers a suite of features for LLMs that span the entire lifecycle of the model. These features include cost-optimized batch inference, efficient fine-tuning, and optimized model serving. RayTurbo’s LLM features enable LLM users to seamlessly develop LLM workflows and integrate them into the rest of their ML processes.

LinkWorkload Optimizations

Below we detail the various optimizations and features for each individual workload.

LinkData

Feature

Description

Accelerated Metadata Fetching 

Optimizations to accelerate read-intensive data workloads. On certain workloads, speedup is measured to be up to 4.5x compared to open source.

Resumable Jobs

Resumable jobs allow Ray Data jobs to be checkpointed, stopped, and resumed without needing to rerun the entire pipeline from scratch. This is most useful for situations where cluster head nodes fail, or when there are changes that need to be made to user code.

Streaming Aggregation

Allows users to implement data aggregation steps with the same key in a streaming way. This is especially useful for video batch inference. 

Improved Autoscaling

Enables clusters and actor pools to autoscale so that users don’t need to wait for the entire large cluster to launch before kicking off the job. Also, the job can scale down and continue during node preemption.

Audio and Video Readers

Purpose-built connector modules for efficiently loading and decoding video and audio data. 

Benchmarks

Untitled design (38)

LinkTraining

Feature

Description

Distributed Elastic Training

Distributed Elastic Training enables training to be continued even under hardware failure. This allows you to run training workloads on spot instances with minimal interruption, thereby reducing costs to train by a significant factor

Improved Training Observability

A purpose-built dashboard designed to streamline the debugging of Ray Train workloads. This dashboard enables users to gain deeper insights into individual workers progress, pinpoint stragglers, and eliminate bottlenecks for faster, more efficient training.

image1

LinkServing

Feature

Description

Fast Autoscaling and Model Loading

Anyscale’s fast model loading capabilities and startup time optimizations improve auto-scaling and cluster startup capabilities. In certain experiments, the end-to-end scaling time for Llama-3-70B is 5.1x faster on Anyscale compared to open-source Ray.

High QPS Serving

In the next release, RayTurbo will provide an optimized version of Ray Serve to achieve up to 54% higher QPS and up-to 3x streaming tokens per second for high traffic serving use-cases.

Replica Compaction

Replica Compaction migrates replicas into fewer nodes where possible to reduce resource fragmentation and improve hardware utilization. 

Zero-Downtime Incremental Rollouts

Perform incremental rollouts and canary upgrades for robust production service management. Unlike KubeRay and OSS Ray Serve, with RayTurbo on Anyscale you can confidently perform upgrades with rollback procedures and without requiring 2x the hardware capacity.  

Observability

Custom metrics dashboards, log search, tracing and alerting, for comprehensive observability into your production services. RayTurbo also has the ability to export logs, metrics and traces to observability tooling of your choice like datadog, etc.

Multi-AZ Services

Enables availability-zone aware scheduling of Ray Serve replicas to provide higher redundancy to availability zone failures.

Containerized Runtime Environments

Configure different container images for different Ray Serve deployments allowing you to prepare dependencies needed per model separately. It comes with all the fast container optimizations included in fast auto-scaling as well as improved security posture over open-source Ray Serve, since it doesn’t require installing podman and running with root permissions. 

Benchmarks

Untitled design (2)
image9
image6
Max QPS - Benchmarked on a m5.8xlarge instance with 1 serve deployment replica. Streaming tokens per second - 2 serve deployments with model composition on one m5.8xlarge (128 tokens per request at an interval of 10ms).

LinkLLM Suite

The LLM Suite in RayTurbo consists of 3 major components with seamless integration between them through our model registry and datasets features to complete the end-to-end LLM development lifecycle: 

LinkFine-tuning (LLMForge)

LLMForge is one of the most comprehensive LLM refinement libraries available with an extensive breadth of fine-tuning techniques. 

Feature

Description

Different Fine-Tuning Techniques

Supports both full parameter and parameter efficient fine-tuning to trade off training time and model quality.

Flexible Task Support

Supports different fine-tuning tasks to maximize the model accuracy for custom applications - causal language modeling, instruction tuning, classification, preference tuning, continuous pre-training, distillation, speculative model training

Flexible Model Support

Supports transformer-based Huggingface models and data formats. It provides flexibility around configuring learning hyper-parameters, hardware and performance

High Performance

State-of-the-art performance features like gradient checkpointing, flash attention v2, mixed precision training and DeepSpeed support. More performance optimizations at both the system and model level are coming soon.

LinkBatch Offline Inference (RayLLM-Batch)

RayLLM-Batch is a library for optimizing and executing batch LLM inference pipelines at scale. Cost improvements can be up to 6x compared to repurposed online inference providers such AWS Bedrock and OpenAI, without requiring high-end hardware like A100s or H100s.

Feature

Description

Fault Tolerance 

Enables workers to resume across spot instances, taking advantage of spot instance cost savings.

Batch LLM Inference Optimizer

[In Private Preview] The Batch LLM Inference optimizer automatically tunes and configures the underlying stack to reduce costs based on workload characteristics and available hardware.

Benchmarks

image3
image2

LinkOnline Inference (RayLLM)

RayLLM offers high performance and fully configurable online serving for any open-sourced large language model as well as multi-modal models.

Feature

Description

Multi-LoRA

Efficient loading of fine-tuned LoRA modules. Leverages model multiplexing in Ray Serve to minimize LoRA loading overhead, as well as vLLM support for multi-LoRA inference to provide highly efficient dynamic LoRA serving. 

JSON Mode

Enables JSON-formatted responses with schema validation, which are useful when you want to integrate the LLM with other systems that expect a reliably parsable output. It can also be used to implement tool calling. 

OpenAI APIs

Set up OpenAI compatible APIs allowing for easy migration from closed source models and compatibility with other LLM development frameworks. It also supports deploying and individually scaling multiple models behind the same service endpoint.

Performance Tuning

Knobs to configure price and performance tradeoff of deployed models, with features such as tensor parallelism, speculative decoding, prefix caching, etc. 

Auto-performance tuning coming soon.

LinkConclusion

RayTurbo aims to provide the best price-performance and developer capabilities for AI workloads compared with other solutions including running Ray in open source. RayTurbo is available on Anyscale exclusively.

Get started on the Anyscale platform for free today at https://www.anyscale.com/.

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.