Case Study

Runway Advances Human Creativity Through Media Creation, Powered by Anyscale

With Anyscale as their AI platform, Runway built and launched Gen-3 Alpha, their most advanced model to date, in the summer of 2024. Learn how.

13x

faster model loading

85%

reduction in data pipeline development and deployment time, cutting it from a week to just a day.

40-50

Runway engineers use Anyscale

Runway’s mission is as simple as it is ambitious: revolutionize media creation by giving creatives tools to generate and edit video content like never before. With their revolutionary Gen-3 Alpha video model, they’re enabling the next era of art, entertainment, and human creativity powered by AI.

LinkGen-3 Alpha: The Future of Media Creation

Runway’s Gen-3 Alpha is the first of the next generation of foundation models for media generation, regularly praised for its combination of speed and quality, with the ability to generate videos in less than a minute from a variety of inputs including text, images, and more.

But to create a model as good as Gen-3 Alpha, the Runway engineering team needed to curate and process massive quantities of video data. Video and other multimodal data can be orders of magnitude larger than traditional structured data, which introduces unique challenges for data processing pipelines. To create and train Gen-3 Alpha, the Runway team needed internal infrastructure designed specifically for large-scale multimodal training. Spoiler alert: Anyscale is at the heart of that infrastructure.

We got a chance to sit down with Cindy Wang, Staff ML Engineer and Matt Kafonek, Senior Software Engineer from Runway’s engineering team to learn more about their journey building Runway’s AI infrastructure.

LinkThe Challenge: Scaling Laws and Limitations

Having already launched Gen-1 (2022) and Gen-2 (2023), Runway's team knew Gen-3 Alpha would require unprecedented quantities of data and data processing capabilities.

Historically, when Runway researchers needed to run a distributed computing job, they would run data preprocessing as a one-off script, manually shard their datasets as needed, and then run the job using a mixture of SLURM and home-grown tools. However, due to the importance of scaling laws, the Runway team knew their next models, in particular Gen-3 Alpha, would need to use orders of magnitude more data. But as the team turned their eyes to training their biggest, most complex model yet, they knew they’d need to revamp their infrastructure to enable that scale.

The team needed infrastructure that could:

Scale compute effortlessly – especially on spot instances
Coordinate heterogeneous compute across mixed CPU and GPU resources
Handle failures and recover from failures without significant overhead
Provide a familiar, Python-native interface for developers and researchers
Provide the flexibility to quickly adopt the latest open-source models and research techniques
Remain cloud-agnostic

LinkRay: The Clear Choice

The Runway team immediately turned to Ray, which checked all of their boxes.

“Using Ray was a really straightforward decision. It’s hard to even compare Ray to anything – it’s a bit of a default at this point for this type of thing.” – Cindy Wang, Staff ML Engineer @ Runway

Their first step was to test Ray at a small scale. Thanks to Ray’s flexibility to run in a variety of environments, from Kubernetes to cloud VMs to a laptop, the team was able to set up a demo in one day – which solidified their decision to adopt Ray at a larger scale.

“We chose Ray because it worked – it’s Python native, open source, and has a familiar syntax that our engineering base could prototype with at a small scale within a day. Seeing results so quickly in the prototype stage made it easy to decide to scale to our full dataset.” – Matt Kafonek, Senior Software Engineer @ Runway

LinkThe Observability and Multitenancy Hurdle

Runway became one of the many Ray users using Ray on Kubernetes. So to scale Ray up past the testing phase, Cindy turned to KubeRay, the open source Kubernetes operator for managing Ray – a seemingly perfect fit for what the Runway engineering team needed. With KubeRay, the researchers were able to all submit to a dedicated Ray cluster and start running large scale distributed computing jobs. But as Runway scaled KubeRay usage to more researchers, they quickly started to hit limitations.

“KubeRay worked well until there were four or five people trying to run jobs at the same time – then it quickly became unmanageable. It got to a point where we considered hiring someone whose full time job would be to manage the KubeRay infrastructure.” – Cindy Wang, Staff ML Engineer @ Runway

The issue was observability – it was nearly impossible for researchers to understand whether their jobs were scaling as expected. And, when jobs weren’t scaling as expected, it was difficult to figure out why. After all, jobs can fail for a number of reasons, whether it’s:

An infrastructure problem at the hardware or cloud provider level, including common issues like GPU failures
An issue or bug in the application logic
A performance issue related to the application architecture
A bug in a third-party library or other dependency
Resource contention due to multitenancy

“We ended up having one cluster that four or five different people were submitting jobs to – which obviously doesn't work well at scale. Our researchers would accidentally set up their resources incorrectly and accidentally mess up someone else’s job – so fundamentally, KubeRay wasn’t working well for a multitenant situation and we needed far better observability tooling.” – Cindy Wang, Staff ML Engineer @ Runway

LinkAnyscale: Ray – Perfected

To continue using KubeRay, Cindy knew the Runway team would need to dedicate at least one full time engineer to managing their KubeRay system. Simply put, it wasn’t something the small team could afford to spend time on.

Given the challenges with KubeRay, the Runway team turned to Anyscale. Built by the creators of Ray, Anyscale is the Unified AI Platform, offering an advanced version of Ray (RayTurbo) as well as enterprise-level observability and governance capabilities along with an easy-to-use developer environment.

“The fact that we don’t have to dedicate a person to make all of the plumbing and infrastructure work has been really valuable.” – Cindy Wang, Staff ML Engineer @ Runway

And unlike Ray, Anyscale isn’t just an AI Compute Engine – it’s a fully formed AI Platform built for developers by developers, with features like:

Workspaces and templates to help developers get started out of the box
Critical integrations with key ML tools
Log observability integrations with Grafana
Seamless cluster management with ephemeral clusters

Now, Runway’s researchers can write scripts, run jobs, and debug code – all in one place.

“Switching to Anyscale made Ray far easier to use. Now, researchers can just submit jobs and look at the dashboard in Anyscale without having to think about cluster management. It works out of the box.” – Cindy Wang, Staff ML Engineer @ Runway

It’s not only better for the researchers, but for the platform team as well. Anyscale offers a variety of out of the box performance and cost optimizations, including autoscaling and fault tolerance. Its best-in-class multimodal data processing flexibly accommodates highly complex and varied workloads, especially those that require mixed CPU and GPU compute.

“It’s very easy for me to take action with a glance at the Anyscale dashboard – whereas before I’d have to look at all of the pods that were running and try to mentally calculate whether or not that mapped to what should be happening.” – Matt Kafonek, Senior Software Engineer @ Runway

Observability like this isn’t just a nice to have – it drives meaningful cost savings and offers critical reliability. With Anyscale, Runway achieved:

Rapid autoscaling and instance startup (clusters launch and scale in under a minute)
Reliable spot instance support
Rapid data ingest
Automatic job retries and improved job monitoring

All of this reduces risk around the infrastructure as Runway scales and gives the Runway team more time to create amazing models and tools – rather than developing and maintaining the infrastructure.

LinkRunway’s AI Infrastructure Today

With Anyscale as their AI platform, Runway built and launched Gen-3 Alpha, their most advanced model to date, in the summer of 2024.

In addition to using the Anyscale platform for its infrastructure, governance, and observability support, Runway also deploys some of Anyscale’s optimized Ray libraries, including Ray Serve. Runway uses Ray Serve to deploy models during model training. Being able to preprocess the data before it is fed into training is critical for rapid iteration, but because of the size of the models, colocating the preprocessing on the same GPUs as training can consume too much GPU memory. With Ray Serve, Runway can offload the preprocessing to a separate dedicated pool of compute resources for last-mile preprocessing during training.

No tool exists in a vacuum – and AI developers know that better than anyone. Anyscale and Ray plug in to Runway’s existing infrastructure to support and scale their compute workloads.

In particular, it was critical to be able to integrate Anyscale with Kubernetes, since Runway has such a robust Kubernetes cluster management system. Anyscale partnered with Runway to launch the Anyscale Operator for Kubernetes and Anyscale on GKE, making it easier than ever to run Anyscale wherever you compute.

“The new offering for Anyscale on GKE is a gamechanger for us. It’s particularly valuable because, in a Kubernetes environment, we can still provision access and lock it down and maintain security.” – Matt Kafonek, Senior Software Engineer @ Runway

LinkWhat’s Next for Runway AI

Gen-3 Alpha may be the world’s most advanced video generation model yet – but Runway isn’t done changing the game. They recently partnered with Lionsgate in a first-of-its-kind partnership to continue to shape the future of filmmaking and content creation.

Additionally, their research progress has accelerated – Cindy shares that the next step is about improving data quality and running more preprocessing jobs to curate higher-quality datasets to produce even better models.

“Anyscale enables us to push the boundaries of what’s possible in generative AI by giving us the flexibility to scale workloads seamlessly. This removes the risk around our infrastructure and allows our team to focus on innovation rather than infrastructure bottlenecks.”

Anastasis Germanidis

Co-Founder & CTO @ Runway