Ray Data

ML library for distributed and unstructured data processing. Anyscale supports and further optimizes Ray Data for improved performance, reliability, and scale.

17x

faster compared to AWS SageMaker for data preprocessing

2x

faster than Apache Spark for unstructured data preprocessing

up to

4.5x

faster at running read-intensive data workloads compared to open source Ray

up to

6x

cheaper for LLM batch inference compared to AWS Bedrock and OpenAI

What is Ray Data?

Ray Data is a scalable data processing library for ML and AI workloads.

With flexible and performant APIs for distributed data processing, Ray Data enables offline batch inference and data preprocessing ingest for ML training. Built on top of Ray Core, it scales effectively to large clusters and offers scheduling support for both CPU and GPU resources. Ray Data also uses streaming execution to efficiently process large datasets and maintain high GPU utilization.

Ray Data small

Use Cases

Offline Batch Inference

Ray Data offers an efficient and scalable solution for batch inference, consistently outperforming competitors:

  • 17x faster than SageMaker Batch Transform
  • 2x faster than Spark for offline image classification
Offline Batch Inference

Benefits

Maximize GPU and CPU Utilization

Leverage CPUs and GPUs in the same pipeline with to increase GPU utilization and decrease costs

Built-in Fault Tolerance

Ray Data automatically recovers from out-of-memory failures and spot instance preemption.

One Unified API

Work with your favorite ML frameworks and libraries, just at scale. Ray Data supports any ML framework of your choice—from PyTorch to HuggingFace to Tensorflow and more.

Any Format, Any Data Type

Ray Data supports a wide variety of formats including Parquet, images, JSON, text, CSV, and more, as well as storage solutions like Databricks and Snowflake.

Supercharge Ray Data with Anyscale

Best Price-Performance

Anyscale’s Ray Data consistently outperforms competitors, while also offering reduced costs and maximized compute utilization. Get faster downloading and better fault tolerance—all in one place.

AI usage and spend

Best-In-Class Data Processing for ML/AI

Text Support

Apache Spark
Amazon SageMaker
Ray
anyscale blue

Image Support

Apache Spark
Amazon SageMaker
-
Ray
anyscale blue

Audio Support

Apache Spark
Manual
Amazon SageMaker
-
Ray
Manual
anyscale blue

Video Support

Apache Spark
Manual
Amazon SageMaker
-
Ray
Manual
anyscale blue

Video Support

Apache Spark
-
Amazon SageMaker
-
Ray
Binary
anyscale blue
Binary

Task-Specific CPU & GPU Allocation

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Stateful Tasks

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Native NumPy Support

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Native Pandas Support

Apache Spark
Amazon SageMaker
-
Ray
anyscale blue

Model Parallelism Support

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Nested Task Parallelism

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Fast Node Launching and Autoscaling

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue
60 sec

Fractional GPU Support

Apache Spark
-
Amazon SageMaker
Limited
Ray
anyscale blue

Load Datasets Larger Than Cluster Memory

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Improved Observability

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Autoscale Workers to Zero

Apache Spark
-
Amazon SageMaker
Limited
Ray
anyscale blue

Job Queues

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Priority Scheduling

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Accelerated Execution

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Data Loading / Data Ingest / Last Mile Preprocessing

Apache Spark
Amazon SageMaker
Ray
anyscale blue
Apache Spark
Amazon SageMaker
Ray
anyscale blue

Text Support

Apache Spark
Amazon SageMaker
Ray
anyscale blue

Image Support

Apache Spark
Amazon SageMaker
-
Ray
anyscale blue

Audio Support

Apache Spark
Manual
Amazon SageMaker
-
Ray
Manual
anyscale blue

Video Support

Apache Spark
Manual
Amazon SageMaker
-
Ray
Manual
anyscale blue

Video Support

Apache Spark
-
Amazon SageMaker
-
Ray
Binary
anyscale blue
Binary

Task-Specific CPU & GPU Allocation

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Stateful Tasks

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Native NumPy Support

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Native Pandas Support

Apache Spark
Amazon SageMaker
-
Ray
anyscale blue

Model Parallelism Support

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Nested Task Parallelism

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Fast Node Launching and Autoscaling

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue
60 sec

Fractional GPU Support

Apache Spark
-
Amazon SageMaker
Limited
Ray
anyscale blue

Load Datasets Larger Than Cluster Memory

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Improved Observability

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Autoscale Workers to Zero

Apache Spark
-
Amazon SageMaker
Limited
Ray
anyscale blue

Job Queues

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Priority Scheduling

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Accelerated Execution

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Data Loading / Data Ingest / Last Mile Preprocessing

Apache Spark
Amazon SageMaker
Ray
anyscale blue

Out-of-the-Box Templates & App Accelerators

Jumpstart your development process with custom-made templates, only available on Anyscale.

Batch Inference with LLMs

Run LLM offline inference on large scale input data with Ray Data

Computing Text Embeddings

Compute text embeddings with Ray Data and HuggingFace models.

Pre-Train Stable Diffusion

Pre-train a Stable Diffusion V2 model with Ray Train and Ray Data

Canva Logo Black

“We have no ceiling on scale, and an incredible opportunity to bring AI features and value to our 170 million users.”

Greg Roodt
ML Lead, Canva

FAQs

The Best Option for Data Processing At Scale

Get up to 90% cost reduction on unstructured data processing with Anyscale, the smartest place to run Ray.