Home ResourcesServe Models At Scale

Serve Models At Scale

There are four common patterns of machine learning production: pipeline, ensemble, business logic, and online learning. Implementing these patterns typically involves a tradeoff between ease of development and production readiness. Web frameworks are simple and work out of the box but can only provide single predictions; they cannot deliver performance or scale. Custom tooling glue tools together but are hard to develop, deploy, and manage. Specialized systems are great at serving ML models but they are not as flexible or easy to use and can be costly.

Anyscale helps you go beyond existing model serving limitations with Ray and Ray Serve, which offers scalable, efficient, composable, and flexible serving. Ray Serve provides:

A better developer experience and abstraction
The ability to flexibly compose multiple models and independently scale them
Build-in request batching to help you meet your performance objectives
And resource management (CPUs, GPUs) to specify fractional resource requirements

Sign up for Anyscale access

Iterate & Move to Production Fast With Ray Serve & Anyscale

Develop on your laptop and then scale the same Python code elastically across hundreds of nodes or GPUs on any cloud — with no changes. Go beyond ML model serving limitations with Ray and Ray serve.

Learn More

Ray Serve is the Scalable & Best Way to do Model Serving

Train, test, deploy, serve, and monitor machine learning models efficiently and with speed with Ray and Anyscale.

Flexible Environment

Effective machine learning serving frameworks need to be open to meet different demands. Ray Serve allows you to bring your own Docker, is multi-framework (e.g., TF, PyTorch, Sklear, XGboox, etc.), offers different runtime environments per tasks and actors, and different framework versions running on each task and actor.

Optimize developer productivity & resource management

By building on top of Ray, Ray Serve is horizontally scalable, lightning fast, and efficient by allowing fractional and fine-grained resource allocation.

Author complex inference pipelines

Chaining, parallelization, ensemble, and dynamic dispatch patterns can be easily expressed with plain Python code. Test locally and deploy to production with no cde changes and different runtime environments per tasks and actors. Clearly define the separation and boundaries between code and deployments.

A Web Server & an ML Serving Compute Library

With native support for FastAPI, Ray Serve allows you to bridge the gap between web server and specialized model serving frameworks. Leverage automatic documentation, typed python (Pydantic), validation, security and authentication, performance, asynchronicity, and routing.

What Users are Saying About Ray and Anyscale

Explore how thousands of engineers from companies of all sizes and across all verticals are tackling real-world workloads with Ray and Anyscale.

At OpenAI, we are tackling some of the world’s most complex and demanding computational problems. Ray powers our solutions to the thorniest of these problems and allows us to iterate at scale much faster than we could before. As an example, we use Ray to train our largest models, including ChatGPT.

Greg Brockman | Co-founder, Chairman, and President

See All

Anyscale is a fully managed scalable Ray compute platform that provides the easiest way to develop, deploy and manage Ray applications.

Get Started