Unstructured Data Processing at Scale

The best way to do unstructured data processing. Any data format, any scale—only with Ray Data on Anyscale.

Unstructured Data Processing

Any Type of Data. Any Accelerator. Any Use Case.

Support for End-to-End Text & LLM Use Cases

Built for LLM online inference, batch inference, embedding generation, and synthetic data generation.

Reduce Costs When Processing Videos

Maximize compute utilization and leverage GPUs and CPUs to process videos of any size.

Image Processing at Scale

Scale image processing workloads by independently scaling CPU and GPU resources, delivering high throughput, lower costs, and improved utilization.

Enhanced Audio Processing

Process audio data without breaking the bank. Anyscale makes it easy to run a variety of use cases, including speech to text.

Best-In-Class Data Processing for ML/AI

Text Support

Apache Spark
Amazon SageMaker
Ray
anyscale blue

Image Support

Apache Spark
Amazon SageMaker
-
Ray
anyscale blue

Audio Support

Apache Spark
Manual
Amazon SageMaker
-
Ray
Manual
anyscale blue

Video Support

Apache Spark
Manual
Amazon SageMaker
-
Ray
Manual
anyscale blue

Video Support

Apache Spark
-
Amazon SageMaker
-
Ray
Binary
anyscale blue
Binary

Task-Specific CPU & GPU Allocation

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Stateful Tasks

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Native NumPy Support

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Native Pandas Support

Apache Spark
Amazon SageMaker
-
Ray
anyscale blue

Model Parallelism Support

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Nested Task Parallelism

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Fast Node Launching and Autoscaling

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue
60 sec

Fractional GPU Support

Apache Spark
-
Amazon SageMaker
Limited
Ray
anyscale blue

Load Datasets Larger Than Cluster Memory

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Improved Observability

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Autoscale Workers to Zero

Apache Spark
-
Amazon SageMaker
Limited
Ray
anyscale blue

Job Queues

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Priority Scheduling

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Accelerated Execution

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Data Loading / Data Ingest / Last Mile Preprocessing

Apache Spark
Amazon SageMaker
Ray
anyscale blue
Apache Spark
Amazon SageMaker
Ray
anyscale blue

Text Support

Apache Spark
Amazon SageMaker
Ray
anyscale blue

Image Support

Apache Spark
Amazon SageMaker
-
Ray
anyscale blue

Audio Support

Apache Spark
Manual
Amazon SageMaker
-
Ray
Manual
anyscale blue

Video Support

Apache Spark
Manual
Amazon SageMaker
-
Ray
Manual
anyscale blue

Video Support

Apache Spark
-
Amazon SageMaker
-
Ray
Binary
anyscale blue
Binary

Task-Specific CPU & GPU Allocation

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Stateful Tasks

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Native NumPy Support

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Native Pandas Support

Apache Spark
Amazon SageMaker
-
Ray
anyscale blue

Model Parallelism Support

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Nested Task Parallelism

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Fast Node Launching and Autoscaling

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue
60 sec

Fractional GPU Support

Apache Spark
-
Amazon SageMaker
Limited
Ray
anyscale blue

Load Datasets Larger Than Cluster Memory

Apache Spark
-
Amazon SageMaker
-
Ray
anyscale blue

Improved Observability

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Autoscale Workers to Zero

Apache Spark
-
Amazon SageMaker
Limited
Ray
anyscale blue

Job Queues

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Priority Scheduling

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Accelerated Execution

Apache Spark
-
Amazon SageMaker
-
Ray
-
anyscale blue

Data Loading / Data Ingest / Last Mile Preprocessing

Apache Spark
Amazon SageMaker
Ray
anyscale blue
amazon-quote-logo

How Amazon Saved $120 Million Per Year by Choosing Ray Over Spark

With Ray, Amazon could compact 12X larger datasets than Apache Spark, improve cost efficiency by 91%, and process 13X more data per hour.

Maximize GPU and CPU Utilization

Leverage and parallelize CPUs and GPUs in the same pipeline to increase utilization and decrease costs.

  • Schedule fine-grained tasks in the same job across heterogeneous hardware, and parallelize each stage independently. 

Best-Price Performance

Anyscale’s Ray Data consistently outperforms competitors:

  • [17x faster](https://www.anyscale.com/blog/offline-batch-inference-comparing-ray-apache-spark-and-sagemaker) compared to AWS SageMaker 
  • [2x faster](https://www.anyscale.com/blog/offline-batch-inference-comparing-ray-apache-spark-and-sagemaker) than Apache Spark 
  • 90% cost savings on select instances with spot instances 
Cost 400 x 250 white background

Beyond Data Processing

Don’t just process data—use it. Anyscale’s Ray Data slots in seamlessly with other Ray libraries like Ray Train and Ray Serve, so you can effortlessly deliver use cases for batch inference and training.

Map 2.0
CAnva

“We have no ceiling on scale, and an incredible opportunity to bring AI features and value to our 170 million users.”

Greg Roodt
ML Lead, Canva

Out-of-the-Box Templates & App Accelerators

Jumpstart your development process with custom-made templates, only available on Anyscale.

Batch Inference with LLMs

Run LLM offline inference on large scale input data with Ray Data

Computing Text Embeddings

Compute text embeddings with Ray Data and HuggingFace models.

Pre-Train Stable Diffusion

Pre-train a Stable Diffusion V2 model with Ray Train and Ray Data

FAQs

Ray Data is an open source machine learning library built on top of Ray, a best-in-class Pythonic distributed computing platform. Anyscale was founded by the creators of Ray to continue optimizing proprietary technology—built on top of Ray open source—to meet the challenges of the fast-paced AI world. With Anyscale’s proprietary Ray Data, you get access to additional and advanced capabilities including:

  • Faster job startup through incremental metadata fetching
  • Faster autoscaling
  • Improved observability and checkpointing
  • Fault tolerance support
  • Head node recovery
  • Spot instance support
  • Incremental metadata fetching
  • Out-of-the-box data connectors with Snowflake and Databricks
  • Resumable jobs
  • And much more

Book a Demo

The Best Option for Data Processing At Scale

Get up to 90% cost reduction on unstructured data processing with Anyscale, the smartest place to run Ray.