Ray is used for some of the world’s largest data processing workloads. Whether it's AWS processing exabyte scale data, Pinterest doing last mile preprocessing at scale, or processing billions of images for Stable Diffusion pretraining, Ray and its data processing library Ray Data are paving the way for large unstructured data processing workloads.
We're working on improving performance and reliability for Ray as part of Anyscale.
In workloads that involve numerous small files, such as batch image inference, a significant portion of time is typically consumed by metadata retrieval and data access—even before the actual inference process begins.
Our new accelerated metadata fetching improvements in Anyscale can reduce this start-up time by up to 4.5 times compared to open-source Ray for the 1 TiB test dataset containing 128 MiB files. This enhancement leads to faster development cycles and more efficient use of compute resources, ultimately reducing wasted cycles.
We ran a benchmark using Ray Data from Anyscale RayTurbo and compared it to a Ray cluster setup using open-source Ray Data. Anyscale RayTurbo is an optimized version of Ray, only available on the Anyscale platform.
In this particular setup, we tested a workload that loaded a varying number of files onto the Ray cluster and measured the time it took for the runtime to start the actual workload.
Below, we observed that the Anyscale implementation could speed up start-up time by up to 4.5x compared to the same code running on open-source Ray, while also being able to support larger-scale datasets. This reduces the overall dataset read time by 37%.
In the experiment below, we use 128 MiB parquet files in datasets of varying sizes. We run a simple workload where we read the data and iterate over it, similar to the training ingest phase of a distributed training workload. We measure the time to first output in addition to the total time it takes the workload to run across the dataset sizes.
We see that compared to open-source Ray, the Anyscale RayTurbo implementation is able to return the first output up to 4.5 times faster (in the 1 TiB scenario). For end-to-end measurement, Anyscale Ray improvements are up to 4.7 times faster in the 1 GiB scenario and up to 1.37 times faster in the 1 TiB scenario when compared with open-source Ray.
In the experiment below, we use 1 MiB Parquet files in datasets of varying sizes. We employ the same workload as above and take the same measurements: time to first output and time to finish the workload.
We see that compared to open-source Ray, the Anyscale RayTurbo (labeled “Anyscale Ray”) implementation is able to return the first output up to 8.5 times faster, particularly in the 100 GiB scenario. Furthermore, in the 1 TiB dataset case, OSS Ray Data actually times out and fails after 5 minutes.
For end-to-end measurement, Anyscale RayTurbo improvements are up to 6.3 times faster in the 1 GiB scenario and up to 7.7 times faster in the 100 GiB scenario when compared with open-source Ray.
As a disclaimer, the impact of this performance improvement will depend on the size of your dataset partitions, in addition to the total number of files you have. In practice, the end-to-end improvement will depend on the length and duration of your workload.
On Anyscale, these improvements are made available to all users via our platform with no additional configuration required. Users simply need to ensure they are using an updated version of Ray (2.37) to benefit from faster data processing.