Ray Spotlight: How we delivered Ray weekly releases

By Sam Chan   

LinkIntroduction

Innovations in AI with new models, accelerators, and techniques are moving at a breathtaking pace. To keep up, the Ray team at Anyscale recently moved to weekly releases to make sure fixes and enhancements are made available to our growing Ray Community even faster. 

This blog post explores the challenges and successes of this transition.

LinkThe Need for Change

Prior to implementing weekly releases, Ray followed a six-week release cycle. The three primary points of concern: 

  1. Flaky tests: Ray’s CI (read: pre-merge and post-merge tests) and release tests often failed in hard-to-reproduce manners which introduced noise into determining root stability and reliability issues that these very tests were trying to catch.

  2. Heavy Lift for Releases: The infrequent releases resulted in a significant rush to merge pull requests, causing stress and potential oversights.

  3. Customer Frustration: Users had to wait for the next release cycle to access new features or bug fixes, or rely on nightly builds.

To address these issues the team set out to increase release frequency.   

LinkOvercoming Instability

To do so, the team first had to tackle instability within Ray’s codebase. This involved:

  • Addressing Test Failures: Last year, the Ray team faced around 50-70 failing tests per week. Many of these were non-deterministic and flaky, requiring a comprehensive cleanup.

  • Systematic Updates: The team implemented fundamental updates to reduce the introduction of new issues, alongside a rigorous one-time cleanup of existing test failures.

  • Cultural Shift: Transitioning to a weekly release schedule necessitated a shift in company culture, encouraging more regular, manageable updates rather than periodic, high-stress release periods.

Driving and maintaining weekly release blockers to 0
Driving and maintaining weekly release blockers to 0

LinkResults and Impact

Over the past two months we’ve successfully delivered weekly releases to the Ray Community.  

Moreover, moving to weekly releases has: 

  • Enhanced Stability: The Ray team fixed approximately 350+ issues and flaky tests, resulting in a dramatically more reliable test pass rate on each weekly release.

    Some key areas of improvement:

    • Fixing tests for integration points with major Cloud and Accelerator providers (eg: making sure Ray launches reliably on AWS/GCP and works for alternative accelerator vendors like AMD).

    • Making sure batch inference at scale is tested and reliable, especially when accelerated by GPUs.

    • Fixes to our testing infra as well as the code itself for fundamental Ray Core components such as object reference counting, memory management, autoscaler, and scheduler.

    • Fixes to tests that stress the reliability for shuffling large scale data with Ray Data.

    • Clean up and fixes to flaky and failing tests on the Windows platform

Full analysis and issues closed can be found here: ray-test-bot-stability.csv

  • Streamlined Releases: Weekly releases have become routine, with minimal disruptions. Thanks to improved testing infrastructure, the team is able to catch performance regressions early and minimize delays.  

  • Improved Productivity: The reduced stress and regular release cadence have fostered a healthier work environment and allowed engineers to focus on continuous improvements.

Ray version
Release cadence in days (we’re now tracking ~7 for weekly releases)

LinkChallenges and Future Plans

Despite the benefits moving to weekly releases does present some challenges:

  • Frequent Upgrades: Users need to upgrade more often to stay current which can pose operational burdens.  For users of Anyscale’s managed Ray platform, this is simplified by ensuring new versions are tested and integrated seamlessly.

  • API Stability: As Ray matures, the team is focused on enhancing API stability and documentation to make adapting to changes easier for users.

To address these challenges, the Ray team is exploring the idea of a Long-Term Support (LTS) version to provide a stable foundation for users who prefer less frequent updates, alongside a managed upgrade service for smooth transitions.

The transition to weekly releases marks a major milestone for Ray and Anyscale, underscoring our dedication to stability and reliability at scale while still pushing enhancements and innovation at a pace keeping up with the cutting edge of distributed machine learning in the industry. 

As Ray evolves, our emphasis on robust testing, streamlined processes, and user-centric features will maintain its leadership in distributed computing technology.

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.