Home GlossaryWhat is Ray Data?

What is Ray Data?

LinkSummary

From automated career guidance to AI-led mock interviews to improved job recommendations, generative AI has started to transform the job search process. Handshake, which serves tens of millions of students, is leading the charge.
Handshake’s business serves schools, employers and students seeking jobs. The scale of this opportunity is immense, with a million job opportunities and -15 million candidates trying to find the right match.

LinkRay Data Explained

Ray Data is an open source, scalable distributed training library for data processing ML workloads. Combined with Ray Core (also open source), you can use Ray Data library’s flexible and performant APIs to scale data processing to large clusters. Ray Data leverages Ray Core’s task, actor, and object APIs to enable large-scale machine learning ingest, training, and inference—all within a single Python application.

Ray Data is built on Ray, so it easily scales on a heterogeneous cluster, which is a type of cluster that can scale across different compute forms, like CPU and GPU machines. It also supports operations requiring stateful setup and GPU acceleration. Because Ray data uses streaming execution, it is capable of processing datasets of any size, making it easy to scale offline batch inference and enabling data preprocessing and ingest for ML training.

You can also use Ray Data to:

Load distributed data into Ray using any popular storage backends and file formats.
Support basic parallel data transformations such as map, batched map, and filter
Support global operations such as sort, shuffle, groupby, and stats aggregations
Integrate seamlessly with other Ray-enabled data processing libraries (i.e. Spark, Pandas, NumPy, Dask, Mars, and others) and ML frameworks (i.e. TensorFlow, Torch, Horovod, and more).
Stream from datasets stored on local disk or the cloud.
Scale data processing with distributed in-memory and on-disk caching.
Scale-out with CPU-only nodes, alongside your GPU nodes.
Automatically recovery from out-of-memory failures in your data preprocessing pipeline.

LinkHow to Use Ray Data with Your Training Pipeline

Ray Data is not intended as a replacement for generic data processing systems like Spark. Rather, Ray Data is most effective when it serves as the last-mile bridge between ETL pipelines and distributed applications running on Ray or Anyscale.

This bridge becomes additionally powerful if you use Ray-integrated DataFrame libraries during your data processing stage. By integrating additional libraries, you can run a full data-to-ML pipeline on top of Ray, eliminating the need to materialize data to external storage as an intermediate step. Essentially, Ray Data forms the distributed data bridge between pipeline stages.

Link4 Benefits of Ray Data

Link1. Faster and cheaper data processing for modern deep learning applications

Ray Data is specifically designed for deep learning applications that involve both CPU preprocessing and GPU inference. To maximize utilization and reduce costs, Ray Data streams working data from CPU preprocessing tasks directly to GPU inference or training tasks, allowing you to use both sets of resources concurrently. With Ray Data, your GPUs are no longer idle during CPU computation, reducing overall cost of a batch inference job.

Link2. Cloud, framework, and data format agnostic

Ray Data has no restrictions on cloud provider, ML framework, or data format.
You can start a Ray cluster on AWS, GCP, or Azure clouds.
You can use any ML framework of your choice, including PyTorch, HuggingFace, or Tensorflow.
Ray Data also does not require a particular file format, and supports a wide variety of formats including Parquet, images, JSON, text, CSV, etc.

Link3. Out-of-the-box scaling on heterogeneous clusters

With Ray Data, code that works on one machine also runs on a large cluster without any changes, allowing you to easily scale to hundreds of nodes to process hundreds of TB of data.

Link4. Unified API and backend for batch inference and ML training

With Ray Data, you can express batch inference and ML training jobs directly under the same Ray Dataset API.

LinkHow To Get Started with the Ray Data Library

To get started with the Ray Data library, first make sure you’ve installed Ray Core. For more information on how to install Ray, refer to our Installing Ray documentation.

Then, to install Ray Data, run:

For more resources, check out our Ray Data documentation or the Ray GitHub project.

LinkManaged Ray w/ Anyscale

Ray Data is an open source machine learning library intended to facilitate data processing via distributed compute. With Managed Ray on Anyscale, you get access to additional, advanced data processing benefits like:

Best-price performance
Autoscaling
Resumable jobs
Streaming aggregation
Incremental metadata fetching

FAQs

There are a variety of Anyscale customers and Ray open source users using the Ray Data library to build advanced AI applications. A few exciting examples include:

Companies using Ray Data for offline batch inference:

Companies using Ray Data for ML training ingest:

Summary
Ray Data Explained
How to Use Ray Data with Your Training Pipeline
4 Benefits of Ray Data
1. Faster and cheaper data processing for modern deep learning applications
2. Cloud, framework, and data format agnostic
3. Out-of-the-box scaling on heterogeneous clusters
4. Unified API and backend for batch inference and ML training
How To Get Started with the Ray Data Library
Managed Ray w/ Anyscale
FAQs

Get Started with Anyscale

Any accelerator, any cloud, any framework. Anyscale is made for how you work today
and how you want to scale tomorrow.