Building a Production RAG Pipeline: LangChain + ChromaDB

Building a Retrieval-Augmented Generation (RAG) system is the "Hello World" of AI engineering in 2026. But 90% of tutorials stop at the "Naive RAG" stage. In production, this fails.

92% Retrieval Accuracy

<200ms P99 Latency

$0.02 Cost per 1k Queries

1. Naive vs. Production RAG

Naive RAG simply retrieves the top K chunks based on vector similarity.
Production RAG adds layers of intelligence:

Hybrid Search: Combining Vector Search (Semantic) with BM25 (Keyword).
Re-ranking: Using a cross-encoder to score the retrieved documents for true relevance, not just vector proximity.
Query Expansion: Generating multiple variations of the user's question.

2. Architecture Overview

graph LR A[User Query] --> B[Hybrid Retrieval] B --> C{BM25 + Semantic} C --> D[Ranked Docs] D --> E[Re-Ranker] E --> F[Top K Context] F --> G[LLM Generation] G --> H[Final Answer] style A fill:#f9f,stroke:#333,stroke-width:2px style G fill:#bbf,stroke:#333,stroke-width:2px

                        The Production Stack
                        Orchestration: LangChain (Python)
Vector Store: ChromaDB (Local/Server mode)
Embedding Model: OpenAI `text-embedding-3-small` (Cost-efficient)
Re-ranker: Cohere Rerank or BAAI/bge-reranker
LLM: GPT-4o-mini (for speed)

                    

3. Advanced Ingestion Strategy

Don't just blind-chunk. Use ParentDocumentRetriever. This technique chunks documents into small pieces for embedding (better semantic match) but retrieves the entire parent section for the LLM (better context).

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter

# The trick: Small chunks for search, Big chunks for context
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

store = InMemoryStore()
vectorstore = Chroma(collection_name="split_parents", embedding_function=embedding)

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

Chunking Strategies Compared

Not all chunking is created equal. The strategy you choose directly impacts retrieval quality. Here's a breakdown of the most common approaches and when to use each.

Strategy	Chunk Size	Best For	Drawback
Fixed-Size (Naive)	512 tokens	Quick prototypes	Breaks mid-sentence, loses context
Recursive Character	400–1000 tokens	General documents	Still ignores semantic boundaries
Parent-Child (Recommended)	400 child / 2000 parent	Production RAG	Requires dual storage setup
Semantic Chunking	Variable	Research papers, legal docs	Slow, needs embedding model at ingest

Pro Tip: For most use cases, start with Parent-Child chunking. Only move to Semantic Chunking if your documents have highly variable section lengths (e.g., legal contracts where a clause can be 1 sentence or 3 pages).

4. Implementing Hybrid Search

Vector search is great for concepts ("How do I fix a leak?"), but terrible for exact matches ("Error code 503"). We need EnsembleRetriever.

Pro Tip: Always weight your BM25 retriever slightly lower (e.g., 0.4) than your vector retriever (0.6) for general Q&A bots.

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# 1. Keyword Retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 5

# 2. Semantic Retriever (Chroma)
chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# 3. Combine them
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, chroma_retriever],
    weights=[0.4, 0.6]  # 40% Keyword, 60% Semantic
)

5. Re-ranking: The Secret Sauce

Retrieving 10 documents is easy. Knowing which one actually contains the answer is hard. Embedding models compress meaning into dense vectors, losing nuance. A Cross-Encoder Re-ranker takes the query and the document pair and outputs a relevance score.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

compressor = CohereRerank(top_n=3)  # Only keep top 3 AFTER re-ranking
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=ensemble_retriever
)

# This pipeline is now: Hybrid Search (Top 10) -> Re-rank -> Top 3 -> LLM

6. Evaluation with RAGAS

You can't improve what you don't measure. I use the RAGAS framework to effectively "unit test" the pipeline.

Faithfulness: Does the answer hallucinate?
Answer Relevance: Is the answer useful?
Context Recall: Did we retrieve the ground truth?
Context Precision: Are the top-ranked docs actually relevant?

In my benchmarks, adding the Re-ranker increased Context Precision by 18%, drastically reducing hallucinations.

CI/CD Integration for RAG Testing

The biggest mistake teams make is treating RAG evaluation as a one-time exercise. In production, your knowledge base changes daily — new documents are indexed, old ones are deprecated. You need continuous evaluation baked into your CI/CD pipeline.

# .github/workflows/rag-eval.yml
name: RAG Evaluation
on:
  push:
    paths: ['knowledge_base/**']  # Trigger on KB changes

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run RAGAS Eval Suite
        run: |
          pip install ragas langchain-openai
          python eval/run_ragas.py \
            --dataset eval/golden_qa.json \
            --threshold-faithfulness 0.85 \
            --threshold-relevance 0.80
      - name: Fail if Regression
        run: python eval/check_regression.py

Pro Tip: Maintain a "Golden QA" dataset of 50–100 question-answer pairs. Run RAGAS against it on every knowledge base update. If Faithfulness drops below your threshold, the pipeline blocks deployment.

7. Tools of the Trade

Building production RAG in 2026 requires a modern stack. Here are the essential categories and my recommendations.

                                Vector Stores
                                ChromaDB: Best for local dev and small-scale production.
Pinecone: Fully managed, best for enterprise scale.
Weaviate: Open-source with built-in hybrid search.

                            

                                Orchestration
                                LangChain: The Swiss Army knife — flexible but complex.
LlamaIndex: Data-first approach, best for document Q&A.
Haystack: Production-ready pipelines with minimal boilerplate.

                            

                        Key Takeaways
                        Naive RAG fails in production. You need Hybrid Search (BM25 + Semantic) to cover both keyword and conceptual queries.
Re-ranking is the secret sauce. A Cross-Encoder re-ranker improved our Context Precision by 18% — the single highest-impact change.
Parent-Child chunking > Fixed-size. Small chunks for search, large chunks for context gives you the best of both worlds.
Evaluation is not optional. Use RAGAS in CI/CD to catch regressions before they reach users.
Cost matters. With `text-embedding-3-small` + GPT-4o-mini, we hit $0.02/1k queries — 10x cheaper than GPT-4 alone.

                    

Back to Portfolio

Contents

1. Naive vs. Production RAG

2. Architecture Overview

The Production Stack

3. Advanced Ingestion Strategy

Chunking Strategies Compared

4. Implementing Hybrid Search

5. Re-ranking: The Secret Sauce

6. Evaluation with RAGAS

CI/CD Integration for RAG Testing

7. Tools of the Trade

Vector Stores

Orchestration

Key Takeaways

Share this Article