HomeBlogBlog Detail

Building Scalable RAG Pipelines with Ray and Anyscale

By Kunling Geng   |   June 4, 2025

Welcome to our enhanced guide on building Retrieval-Augmented Generation (RAG) applications. This blog builds upon our previous guide to RAG-based applications. In this follow-up, we take a deeper look at real-world challenges and show how Anyscale and Ray can help you build more scalable, production-ready RAG systems.

All of the notebooks covered in this guide are available in Anyscale Workspaces. You can launch them instantly, explore the full RAG development process, and try it yourself with your own data. Get started with Anyscale, or book a demo today.

Whether you are new to RAG, Ray or Anyscale this blog will walk through the basics. But if you are an expert across any of these feel free to jump straight to the tutorial walk through.

LinkWhy RAG?

Retrieval-Augmented Generation (RAG) helps enterprises unlock value from unstructured documents – like PDFs, slides, emails, forms – by grounding LLM responses in your proprietary data. With RAG, you can reduce hallucinations, provide transparent citations, gracefully handle edge cases, and instantly incorporate new information – all without model retraining.

LinkFour key benefits of RAG:

  1. Reduced hallucinations: RAG helps reduce hallucinations by grounding answers in verifiable, up-to-date data – including proprietary internal docs that can’t be shared for model training due to privacy or regulatory constraints.

  2. Transparent sourcing: Well-designed RAG systems provide citations, allowing users to verify exactly where information came from.

  3. Graceful fallbacks: A properly implemented RAG system will inform users when it can't find relevant sources, effectively flagging potential hallucinations.

  4. No retraining needed: Simply add the new documents to your RAG system through data ingestion, and your system can immediately answer questions about this new information.

When building RAG-based applications, it's important to consider three key components of the architecture:

  • Data Pipeline: Handles document ingestion, text chunking, and embedding generation. This stage transforms raw data into vector representations ready for retrieval.

  • Vector Store: Manages the storage and indexing of embeddings, enabling efficient similarity search and retrieval of relevant context at query time.

  • Serving Pipeline: Embeds incoming user queries, performs vector similarity search against the vector store, and passes the retrieved context to a language model for final answer generation.

LinkWhy Ray for RAG?

  • Distributed Python: Ray is a general-purpose distributed framework built for Python, allowing you to scale end-to-end RAG pipelines using native Python libraries. You can orchestrate multiple components like LangChain for chunking and Hugging Face model for embedding in a single, unified runtime.

  • Heterogeneous (CPU+GPU) clusters: Ray makes it easy to run GPU-bound tasks like embedding generation alongside CPU-bound I/O tasks such as chunking, preprocessing, or vector DB writes. This resource-aware scheduling across a shared cluster improves hardware utilization and eliminates the need for separate infrastructure management.

  • Persistent object store: Ray’s in-memory object store lets intermediate data persist across pipeline stages without intermediate reads/writes to external systems after each step. This reduces latency, simplifies orchestration, and enables efficient chaining of multi-step RAG workflows.

LinkWhy Use Ray on Anyscale

Built on Ray by its creators, Anyscale extends Ray’s distributed compute power with:

  • Observability tooling: Ray is easy to try locally, but when moving to a remote environment debugging and tuning can get complex without unified logs and purpose-built monitoring. With persistent logs, managed dashboards, and workload-specific observability like DAG-based views for data pipelines, Anyscale makes it easy to trace issues and optimize bottlenecks in your distributed RAG workflows.

  • Managed, reliable clusters: Anyscale minimizes Ray cluster operational  overhead with elastic, resilient clusters in your cloud of choice. It offers things like programmatic provisioning, fast autoscaling and auto-shutdown for completed batch jobs—so you can focus on your application, not distributed cluster management. 

  • Exclusive performance optimizations: As pipelines scale with data or user load, for specific workloads, Anyscale can deliver higher performance per compute unit than open-source Ray. It also supports smart use of spot instances, giving you better control over cost and performance.

LinkRay and Anyscale: Powering Enterprise-Scale RAG

A typical RAG pipeline begins with massive amounts of unstructured content that must be parsed, chunked, embedded, and stored as vectors for retrieval. This process can be painfully slow if done sequentially – even ingesting a 10–20 page document can take 5–10 minutes.

Ray enables distributed data processing across CPUs and GPUs, dramatically accelerating this pipeline. With Ray Data, you can extract text from documents using a combination of direct parsing and OCR for scanned or image-based files, chunk and embed the text content, and store it in your vector database – all in parallel and at scale.

On top of that, Anyscale offers the infrastructure to run Ray seamlessly. With Anyscale Workspaces, you get:

  • One-click tutorial launches

  • Autoscaling clusters with pre-configured environments

  • Effortless distributed workload management

For enterprises with 90% of their data locked in unstructured documents, Ray and Anyscale together deliver exceptional end-to-end document processing capabilities that make enterprise-scale RAG not just possible, but practical.

LinkRAG Tutorials on Anyscale: A Step-by-Step Journey

We've created a comprehensive series of notebooks that guide you through building production-ready RAG applications. These tutorials are immediately available on Anyscale Workspaces, allowing you to learn and build simultaneously. 

Each notebook in this guide is designed not just to teach you how RAG works – but to help you build your own real-world solution.

Try on Anyscale

LinkNotebook 1: A Simple (but Flawed) RAG Pipeline

Notebook 1 introduces the fundamental components of a RAG pipeline using a traditional, non-distributed setup.

Notebook 1: A Simple (but Flawed) RAG Pipeline
Notebook 1: A Simple (but Flawed) RAG Pipeline

What you'll do:

  • Load multiple document formats (PDF, DOCX, PPTX, HTML, TXT)

  • Chunk and embed using LangChain and SentenceTransformer

  • Store embeddings in Chroma DB

  • Perform semantic search

What you'll learn:

  • How to process various document formats using Unstructured IO

  • Techniques for text chunking with different strategies (fixed vs. recursive)

  • Basic vector similarity search implementation

  • Embedding generation with SentenceTransformer models

  • Vector storage setup with Chroma DB

Why it matters: This notebook establishes the canonical RAG pipeline that most teams start with. By working through this notebook first, you'll understand exactly why distributed processing becomes necessary as your document collection grows. 

That’s where Notebook 2 comes in.

LinkNotebook 2: Scaling Data Ingestion with Ray

In this notebook, you’ll see how Ray transforms the document processing pipeline through distributed computing, showcasing a highly efficient approach to handling large document collections.

Notebook 2: Scaling Data Ingestion with Ray
Notebook 2: Scaling Data Ingestion with Ray

What you'll do:

  • Load 100+ docs from S3 with ray.data.read_binary_files

  • Parallelize chunking, embedding, and storage

  • Optimize CPU/GPU resource allocation – using CPUs for parsing and GPUs for embedding

What you'll learn:

  • How to implement parallel document processing with configurable concurrency

  • Tips for building a complete data pipeline with Ray Data operations

  • Guidelines for designing scalable components for chunking, embedding, and vector storage

Why it matters: The scalable architecture you'll build can process hundreds of documents containing thousands of pages in minutes instead of hours. You'll work with a realistic dataset of 100 documents (over 6,000 pages) and process them in approximately 10 minutes, demonstrating the power of distributed computing for document processing. 

By the end of this notebook, you'll have a production-ready document ingestion pipeline capable of handling enterprise-scale document collections.

LinkNotebook 3: Deploying LLMs with Ray Serve

Once your data is ready, you need to serve your model. This notebook shows you how to serve open-source LLMs with production-grade reliability. 

What you'll do:

  • Spin up a Ray Serve deployment with a large open-source model (e.g., Qwen2.5-32B-Instruct).

  • Use tensor parallelism across 4 GPUs to accelerate inference.

  • Use build_openai_app to create an OpenAI API-compatible endpoint

  • Stream responses to end users in real-time.

What you'll learn:

  • How to configure model deployments with Ray Serve's LLM API and deploy to production with Anyscale Services

  • Tips for managing GPU resources efficiently for inference

  • How to build a custom LLMClient wrapper for both streaming and full-text responses

Why it matters: What’s powerful here is the simplicity. With a few lines of config, you’ll deploy a massive model without worrying about GPU provisioning, batching, or load balancing. Anyscale automates the hard parts.

For security-conscious industries or organizations looking to reduce API costs, self-hosted models are increasingly important. This notebook includes complete code for both local development and production deployment, allowing you to build a secure, cost-effective foundation for your enterprise RAG system.

LinkNotebook 4: Building the Query Pipeline

This notebook ties everything together into a complete end-to-end RAG system that can answer real user questions.

Notebook 4: Building the Query Pipeline
Notebook 4: Building the Query Pipeline

What you'll do:

  • Embed a user query

  • Search for relevant chunks

  • Retrieve and rank relevant document chunks

  • Generate and stream a response from your model

What you'll learn:

  • How to create the complete query processing workflow from user question to final answer

  • How to implement vector search with relevance filtering

  • Tips for designing context integration with prompt templates

  • How to manage LLM interactions with streaming responses

  • Guidelines for building a practical RAG MVP with minimal code

Why it matters: This is your first full-stack RAG MVP – connecting retrieval, ranking, and generation for accurate, context-aware answers.

LinkNotebook 5: Advanced Prompt Engineering

Now it's time to transform your basic RAG system into a professional, production-quality application through sophisticated prompt engineering. Through examples, you’ll see how naive prompts can fail – returning irrelevant, generic, or simplistic answers – and how to fix them with better prompt design.

What you'll do:

  • Handle ambiguous or malicious queries

  • Insert chat history for context

  • Format answers with citations in Markdown format

  • Implement proper request filtering

What you'll learn:

  • How to identify and fix common RAG prompt issues

  • Tips for maintaining context across multiple conversation turns with proper windowing to manage context length

  • Guidelines for transforming ambiguous follow-up questions into context-aware queries

  • How to add safety filters to block irrelevant or malicious queries before they reach your model

  • How to design structured response formatting with Markdown

  • How to add citation systems with source tracking

Why it matters: Through practical examples, you'll transform your basic RAG system into a polished, production-ready application through sophisticated prompt engineering.

LinkNotebook 6: Evaluating RAG with Online Inference (and Why It’s a Problem)

Most RAG projects fail at the eval stage.

As your RAG system matures, evaluation becomes critical – but most approaches have serious limitations. This notebook shows common evaluation methods and why they break down at scale.

Notebook 6: Evaluating RAG with Online Inference (and Why It’s a Problem)
Notebook 6: Evaluating RAG with Online Inference (and Why It’s a Problem)

What you'll do:

  • Load eval prompts from a CSV file.

  • Embed the queries and run the full RAG loop.

  • Collect outputs in a structured format.

  • Save results to CSV for manual review.

What you'll learn:

  • Tips for creating structured test datasets with categorized queries

  • How to build evaluation pipelines for RAG systems

  • How to understand the challenges of online inference for evaluation

  • Guidelines for recognizing scalability bottlenecks in traditional approaches

  • Tips for quantifying overhead and cost implications

Why it matters: This notebook demonstrates how conventional approaches introduce production stability risks, system management overhead, and cost inefficiencies, including:

  • Scalability Challenges: This notebook demonstrates how even evaluating just 64 test questions becomes time-consuming and inefficient with online inference

  • Production Stability Risks: Heavy evaluation workloads can potentially disrupt production LLM services due to resource competition

  • System Overhead: Maintaining separate evaluation infrastructure requires significant engineering resources and operational complexity

  • Cost Inefficiencies: Dedicated evaluation services that remain running incur unnecessary expenses, while using production systems for evaluation creates resource contention

LinkNotebook 7: Scalable Evaluation with Ray Data Batch Inference

The final notebook presents a superior approach to RAG evaluation using Ray Data's batch processing capabilities – making large-scale assessment feasible and efficient. Instead of online inference, we recommend batch evaluation with Ray Data.

Notebook 7: Scalable Evaluation with Ray Data Batch Inference
Notebook 7: Scalable Evaluation with Ray Data Batch Inference

What you'll do:

  • Generate embeddings and prompts in batches

  • Run inference on Ray using GPU workers

  • Collect and store evaluation results

What you'll learn:

  • How to build evaluation pipelines with Ray Dataset

  • Tips for building a complete evaluation workflow with distinct processing stages: (1) Embedding generation for queries, (2) Vector store retrieval and context collection, (3) Prompt rendering with retrieved context, (4) Batch inference using the LLM.

  • Guidelines for configuring GPU utilization with parameters like max_num_batched_tokens (controlling memory usage for token processing), max_num_seqs (setting concurrent sequence processing limits), and tensor_parallel_size (distributing model weights across GPUs).

  • How to configure LLM batch inference for optimal throughput

  • Tips for collecting and analyzing comprehensive evaluation metrics

Why it matters: This notebook provides a production-ready evaluation framework. You'll work with Ray's LLM processor to build a complete evaluation pipeline that handles embedding generation, context retrieval, prompt rendering, and inference – all in a distributed fashion.

LinkWhat’s Next

We're continuously expanding our tutorial library with reference architectures for additional enterprise-focused use cases.

LinkGetting Started

Ready to build your own enterprise-grade RAG application? All the notebooks mentioned in this blog are available directly in Anyscale Workspaces. 

Even if you don't have extensive data science resources, these tutorials will guide you step by step. You can:

  1. Follow the structured learning path from basic RAG to advanced techniques

  2. Inject your own data to test against real-world scenarios

  3. Deploy production-ready systems using the same code

  4. Scale from prototype to enterprise-grade application

Start with Notebook 1 to understand the foundations, then progress through the series to build increasingly sophisticated RAG capabilities. By the end, you'll have a complete, production-ready RAG system tailored to your enterprise needs.

Try on Anyscale

Looking for a guided demonstration? Book a demo with our team to see these notebooks in action with your specific use case.

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.