RAGLLMsAIPython

Getting Started with RAG: Retrieval-Augmented Generation Explained

24 February 2025·8 min read·Harshit Gupta

TL;DR

RAG (Retrieval-Augmented Generation) gives LLMs access to your private, up-to-date data at inference time — without retraining. It works in two phases: offline indexing (chunk → embed → store) and online retrieval (embed query → fetch top-k chunks → inject into prompt). Use RAG when your knowledge changes frequently. Use fine-tuning when you need to change behavior permanently.

The Problem: LLMs Are Frozen in Time

Imagine you've built a customer support bot on GPT-4. Your users love it — until they ask about the new feature you shipped last week. The model confidently gives them outdated information. Or worse, it hallucinates an answer that sounds plausible but is completely wrong.

This isn't a GPT-4 problem. It's a fundamental limitation of how LLMs work: their knowledge is frozen at the training cutoff. Ask any model about your company's internal docs, last quarter's data, or anything that happened after its training — and it will either say "I don't know" or make something up.

Retrieval-Augmented Generation (RAG) solves this elegantly. Instead of baking knowledge into the model's weights, you give the model a retrieval mechanism — a way to look up relevant information from your data store at inference time, right before generating a response.

Why this matters in 2025

As LLM context windows grow larger (128K, 1M tokens), some people ask: why not just dump your entire knowledge base into the prompt? Cost and latency. Embedding a 10,000-document knowledge base into every query would cost thousands of dollars per day and add seconds of latency. RAG retrieves only the 3-5 most relevant chunks — surgical precision at a fraction of the cost.

How RAG Works: The Two-Phase Pipeline

Phase 1: Indexing (runs offline, once)

This is the prep work. You take your documents — PDFs, Notion pages, database records, whatever — and transform them into searchable vector representations:

Chunk your documents into smaller pieces (typically 256–512 tokens)
Embed each chunk using an embedding model (e.g., text-embedding-3-small)
Store the vectors in a vector database (pgvector, Chroma, Pinecone, etc.)

Phase 2: Retrieval + Generation (runs on every query)

Embed the user's question using the same embedding model
Search for the top-k most semantically similar chunks via cosine similarity
Inject retrieved chunks into the LLM prompt as context
Generate a grounded, accurate response

Building a Basic RAG Pipeline in Python

Here's a minimal but complete implementation that demonstrates the core mechanics:

import openai
import numpy as np

def embed(text):
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def retrieve(query, chunks, embeddings, top_k=3):
    query_emb = embed(query)
    scores = [cosine_similarity(query_emb, e) for e in embeddings]
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return [chunks[i] for i in top_indices]

def generate(query, context_chunks):
    context = "\n\n---\n\n".join(context_chunks)
    prompt = f"""You are a helpful assistant. Use only the context below to answer.
If the answer isn't in the context, say "I don't have that information."

Context:
{context}

Question: {query}
Answer:"""
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Common mistake

Notice the explicit instruction "Use only the context below." Without this, the model will blend retrieved context with its training knowledge — defeating the purpose of RAG. Always constrain the model to the provided context and have it admit when information isn't available.

RAG vs Fine-tuning: Which Should You Use?

This is the most common question I get. The answer depends on what you're actually trying to solve:

Use RAG when: your knowledge base changes frequently, you need the model to cite sources, you need to handle large volumes of documents, or you're building an internal knowledge base or document Q&A system.

Use fine-tuning when: you need to permanently change the model's tone/style/behavior, you have a very specific task with consistent input-output patterns, or you need sub-10ms latency with no retrieval step.

For most enterprise applications — internal wikis, customer support, document analysis — RAG is the right starting point. It's cheaper, faster to iterate on, doesn't require retraining, and lets you update your knowledge base without touching the model.

Advanced Patterns Worth Knowing

Hybrid search combines semantic vector search with traditional keyword search (BM25). This is critical for queries with specific technical terms, product names, or codes that embedding models might not capture well semantically. The combination almost always outperforms either approach alone.

Re-ranking adds a second pass: after retrieving the top-20 chunks by similarity, a cross-encoder model re-ranks them by actual relevance to the query. This dramatically improves precision when your retrieval step casts a wide net.

Agentic RAG lets the LLM decide when to retrieve, what to query for, and how many retrieval steps to take. For multi-step reasoning tasks, this approach significantly outperforms a single-shot retrieve-and-generate pipeline. This is the direction the field is moving in 2025.

Oracle AI Vector Search

I recently earned the Oracle AI Vector Search Certified Professional certification, which covers production-grade vector embedding pipelines, similarity search optimization, and building RAG applications on Oracle Cloud. If you're deploying RAG in enterprise environments with existing Oracle infrastructure, OCI's native vector search capabilities are worth a serious look.

Key Takeaways

RAG = retrieval at inference time; your data stays fresh without retraining
Two phases: offline indexing (chunk → embed → store) and online retrieval+generation
Always instruct the model to stay within the provided context to prevent hallucination
Hybrid search (vector + BM25) beats pure vector search for most real-world queries
Choose RAG for dynamic knowledge; fine-tuning for permanent behavior changes
Agentic RAG is the frontier for complex, multi-step reasoning tasks

Back to All Posts

Written by Harshit Gupta