Ray LLM is an ML library for LLM inference, only available on Anyscale.
Combining the performance gains from Anyscale’s optimized vLLM and the production-readiness and scalability capabilities of Anyscale’s Ray Serve, RayLLM is the best way to deploy and manage open source LLMs.
We’ve optimized vLLM capabilities for throughput, latency, and model parallelism. Tune your engine performance to your specific needs and reduce batch and online inference costs.
RayLLM supports any model. It’s compatible with OpenAI’s API, and offers JSON mode and function calling support. Plus, easily deploy multi-LoRA and vision language models.
Fine-tune and deploy any open-source model that’s supported on HuggingFace, including popular open-source models like LLaMA, Mistral, and more. Run any inference engine, including vLLM, TRT-LLM, TGI, and more.
RayLLM is built on top of Ray Serve, Anyscale’s highly scalable and efficient ML serving system. Deploy LLMs without worrying, with support for head node recovery, Multi-AZ support, and zero downtime upgrades.