Welcome to our enhanced guide on building Retrieval-Augmented Generation (RAG) applications. This blog builds upon our previous guide to RAG-based applications. In this follow-up, we take a deeper look at real-world challenges and show how Anyscale and Ray can help you build more scalable, production-ready RAG systems.
All of the notebooks covered in this guide are available in Anyscale Workspaces. You can launch them instantly, explore the full RAG development process, and try it yourself with your own data. Get started with Anyscale, or book a demo today.
Whether you are new to RAG, Ray or Anyscale this blog will walk through the basics. But if you are an expert across any of these feel free to jump straight to the tutorial walk through.
Retrieval-Augmented Generation (RAG) helps enterprises unlock value from unstructured documents – like PDFs, slides, emails, forms – by grounding LLM responses in your proprietary data. With RAG, you can reduce hallucinations, provide transparent citations, gracefully handle edge cases, and instantly incorporate new information – all without model retraining.
Reduced hallucinations: RAG helps reduce hallucinations by grounding answers in verifiable, up-to-date data – including proprietary internal docs that can’t be shared for model training due to privacy or regulatory constraints.
Transparent sourcing: Well-designed RAG systems provide citations, allowing users to verify exactly where information came from.
Graceful fallbacks: A properly implemented RAG system will inform users when it can't find relevant sources, effectively flagging potential hallucinations.
No retraining needed: Simply add the new documents to your RAG system through data ingestion, and your system can immediately answer questions about this new information.
When building RAG-based applications, it's important to consider three key components of the architecture:
Data Pipeline: Handles document ingestion, text chunking, and embedding generation. This stage transforms raw data into vector representations ready for retrieval.
Vector Store: Manages the storage and indexing of embeddings, enabling efficient similarity search and retrieval of relevant context at query time.
Serving Pipeline: Embeds incoming user queries, performs vector similarity search against the vector store, and passes the retrieved context to a language model for final answer generation.
Distributed Python: Ray is a general-purpose distributed framework built for Python, allowing you to scale end-to-end RAG pipelines using native Python libraries. You can orchestrate multiple components like LangChain for chunking and Hugging Face model for embedding in a single, unified runtime.
Heterogeneous (CPU+GPU) clusters: Ray makes it easy to run GPU-bound tasks like embedding generation alongside CPU-bound I/O tasks such as chunking, preprocessing, or vector DB writes. This resource-aware scheduling across a shared cluster improves hardware utilization and eliminates the need for separate infrastructure management.
Persistent object store: Ray’s in-memory object store lets intermediate data persist across pipeline stages without intermediate reads/writes to external systems after each step. This reduces latency, simplifies orchestration, and enables efficient chaining of multi-step RAG workflows.
Built on Ray by its creators, Anyscale extends Ray’s distributed compute power with:
Observability tooling: Ray is easy to try locally, but when moving to a remote environment debugging and tuning can get complex without unified logs and purpose-built monitoring. With persistent logs, managed dashboards, and workload-specific observability like DAG-based views for data pipelines, Anyscale makes it easy to trace issues and optimize bottlenecks in your distributed RAG workflows.
Managed, reliable clusters: Anyscale minimizes Ray cluster operational overhead with elastic, resilient clusters in your cloud of choice. It offers things like programmatic provisioning, fast autoscaling and auto-shutdown for completed batch jobs—so you can focus on your application, not distributed cluster management.
Exclusive performance optimizations: As pipelines scale with data or user load, for specific workloads, Anyscale can deliver higher performance per compute unit than open-source Ray. It also supports smart use of spot instances, giving you better control over cost and performance.
A typical RAG pipeline begins with massive amounts of unstructured content that must be parsed, chunked, embedded, and stored as vectors for retrieval. This process can be painfully slow if done sequentially – even ingesting a 10–20 page document can take 5–10 minutes.
Ray enables distributed data processing across CPUs and GPUs, dramatically accelerating this pipeline. With Ray Data, you can extract text from documents using a combination of direct parsing and OCR for scanned or image-based files, chunk and embed the text content, and store it in your vector database – all in parallel and at scale.
On top of that, Anyscale offers the infrastructure to run Ray seamlessly. With Anyscale Workspaces, you get:
One-click tutorial launches
Autoscaling clusters with pre-configured environments
Effortless distributed workload management
For enterprises with 90% of their data locked in unstructured documents, Ray and Anyscale together deliver exceptional end-to-end document processing capabilities that make enterprise-scale RAG not just possible, but practical.
We've created a comprehensive series of notebooks that guide you through building production-ready RAG applications. These tutorials are immediately available on Anyscale Workspaces, allowing you to learn and build simultaneously.
Each notebook in this guide is designed not just to teach you how RAG works – but to help you build your own real-world solution.
Notebook 1 introduces the fundamental components of a RAG pipeline using a traditional, non-distributed setup.
What you'll do:
Load multiple document formats (PDF, DOCX, PPTX, HTML, TXT)
Chunk and embed using LangChain and SentenceTransformer
Store embeddings in Chroma DB
Perform semantic search
What you'll learn:
How to process various document formats using Unstructured IO
Techniques for text chunking with different strategies (fixed vs. recursive)
Basic vector similarity search implementation
Embedding generation with SentenceTransformer models
Vector storage setup with Chroma DB
Why it matters: This notebook establishes the canonical RAG pipeline that most teams start with. By working through this notebook first, you'll understand exactly why distributed processing becomes necessary as your document collection grows.
That’s where Notebook 2 comes in.
In this notebook, you’ll see how Ray transforms the document processing pipeline through distributed computing, showcasing a highly efficient approach to handling large document collections.
What you'll do:
Load 100+ docs from S3 with ray.data.read_binary_files
Parallelize chunking, embedding, and storage
Optimize CPU/GPU resource allocation – using CPUs for parsing and GPUs for embedding
What you'll learn:
How to implement parallel document processing with configurable concurrency
Tips for building a complete data pipeline with Ray Data operations
Guidelines for designing scalable components for chunking, embedding, and vector storage
Why it matters: The scalable architecture you'll build can process hundreds of documents containing thousands of pages in minutes instead of hours. You'll work with a realistic dataset of 100 documents (over 6,000 pages) and process them in approximately 10 minutes, demonstrating the power of distributed computing for document processing.
By the end of this notebook, you'll have a production-ready document ingestion pipeline capable of handling enterprise-scale document collections.
Once your data is ready, you need to serve your model. This notebook shows you how to serve open-source LLMs with production-grade reliability.
What you'll do:
Spin up a Ray Serve deployment with a large open-source model (e.g., Qwen2.5-32B-Instruct).
Use tensor parallelism across 4 GPUs to accelerate inference.
Use build_openai_app to create an OpenAI API-compatible endpoint
Stream responses to end users in real-time.
What you'll learn:
How to configure model deployments with Ray Serve's LLM API and deploy to production with Anyscale Services
Tips for managing GPU resources efficiently for inference
How to build a custom LLMClient wrapper for both streaming and full-text responses
Why it matters: What’s powerful here is the simplicity. With a few lines of config, you’ll deploy a massive model without worrying about GPU provisioning, batching, or load balancing. Anyscale automates the hard parts.
For security-conscious industries or organizations looking to reduce API costs, self-hosted models are increasingly important. This notebook includes complete code for both local development and production deployment, allowing you to build a secure, cost-effective foundation for your enterprise RAG system.
This notebook ties everything together into a complete end-to-end RAG system that can answer real user questions.
What you'll do:
Embed a user query
Search for relevant chunks
Retrieve and rank relevant document chunks
Generate and stream a response from your model
What you'll learn:
How to create the complete query processing workflow from user question to final answer
How to implement vector search with relevance filtering
Tips for designing context integration with prompt templates
How to manage LLM interactions with streaming responses
Guidelines for building a practical RAG MVP with minimal code
Why it matters: This is your first full-stack RAG MVP – connecting retrieval, ranking, and generation for accurate, context-aware answers.
Now it's time to transform your basic RAG system into a professional, production-quality application through sophisticated prompt engineering. Through examples, you’ll see how naive prompts can fail – returning irrelevant, generic, or simplistic answers – and how to fix them with better prompt design.
What you'll do:
Handle ambiguous or malicious queries
Insert chat history for context
Format answers with citations in Markdown format
Implement proper request filtering
What you'll learn:
How to identify and fix common RAG prompt issues
Tips for maintaining context across multiple conversation turns with proper windowing to manage context length
Guidelines for transforming ambiguous follow-up questions into context-aware queries
How to add safety filters to block irrelevant or malicious queries before they reach your model
How to design structured response formatting with Markdown
How to add citation systems with source tracking
Why it matters: Through practical examples, you'll transform your basic RAG system into a polished, production-ready application through sophisticated prompt engineering.
Most RAG projects fail at the eval stage.
As your RAG system matures, evaluation becomes critical – but most approaches have serious limitations. This notebook shows common evaluation methods and why they break down at scale.
What you'll do:
Load eval prompts from a CSV file.
Embed the queries and run the full RAG loop.
Collect outputs in a structured format.
Save results to CSV for manual review.
What you'll learn:
Tips for creating structured test datasets with categorized queries
How to build evaluation pipelines for RAG systems
How to understand the challenges of online inference for evaluation
Guidelines for recognizing scalability bottlenecks in traditional approaches
Tips for quantifying overhead and cost implications
Why it matters: This notebook demonstrates how conventional approaches introduce production stability risks, system management overhead, and cost inefficiencies, including:
Scalability Challenges: This notebook demonstrates how even evaluating just 64 test questions becomes time-consuming and inefficient with online inference
Production Stability Risks: Heavy evaluation workloads can potentially disrupt production LLM services due to resource competition
System Overhead: Maintaining separate evaluation infrastructure requires significant engineering resources and operational complexity
Cost Inefficiencies: Dedicated evaluation services that remain running incur unnecessary expenses, while using production systems for evaluation creates resource contention
The final notebook presents a superior approach to RAG evaluation using Ray Data's batch processing capabilities – making large-scale assessment feasible and efficient. Instead of online inference, we recommend batch evaluation with Ray Data.
What you'll do:
Generate embeddings and prompts in batches
Run inference on Ray using GPU workers
Collect and store evaluation results
What you'll learn:
How to build evaluation pipelines with Ray Dataset
Tips for building a complete evaluation workflow with distinct processing stages: (1) Embedding generation for queries, (2) Vector store retrieval and context collection, (3) Prompt rendering with retrieved context, (4) Batch inference using the LLM.
Guidelines for configuring GPU utilization with parameters like max_num_batched_tokens (controlling memory usage for token processing), max_num_seqs (setting concurrent sequence processing limits), and tensor_parallel_size (distributing model weights across GPUs).
How to configure LLM batch inference for optimal throughput
Tips for collecting and analyzing comprehensive evaluation metrics
Why it matters: This notebook provides a production-ready evaluation framework. You'll work with Ray's LLM processor to build a complete evaluation pipeline that handles embedding generation, context retrieval, prompt rendering, and inference – all in a distributed fashion.
We're continuously expanding our tutorial library with reference architectures for additional enterprise-focused use cases.
Ready to build your own enterprise-grade RAG application? All the notebooks mentioned in this blog are available directly in Anyscale Workspaces.
Even if you don't have extensive data science resources, these tutorials will guide you step by step. You can:
Follow the structured learning path from basic RAG to advanced techniques
Inject your own data to test against real-world scenarios
Deploy production-ready systems using the same code
Scale from prototype to enterprise-grade application
Start with Notebook 1 to understand the foundations, then progress through the series to build increasingly sophisticated RAG capabilities. By the end, you'll have a complete, production-ready RAG system tailored to your enterprise needs.
Looking for a guided demonstration? Book a demo with our team to see these notebooks in action with your specific use case.