RAG From Zero: Build a Bot That Answers Your Docs

Retrieval-Augmented Generation (RAG) is the technique behind every "chat with your documents" feature you have seen. The idea is simple: instead of fine-tuning a model on your data, you store your data in a searchable format, retrieve the relevant pieces at query time, and pass them to the model as context. The model answers the question using the retrieved text, grounding its response in your actual content. Here is how to build it from scratch in pure Python.

Load and Chunk Your Document

The first step is getting your document into text form and splitting it into chunks. Chunks should be small enough to be semantically focused (one topic per chunk) but large enough to contain full sentences and useful context. A chunk size of 300 to 500 words with a 50-word overlap between adjacent chunks is a reliable starting point.

import pypdf

def load_pdf(file_path: str) -> str:
    reader = pypdf.PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "
"
    return text

def chunk_text(text: str, chunk_size: int = 400, overlap: int = 50) -> list:
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

# Usage
raw_text = load_pdf("product_manual.pdf")
chunks = chunk_text(raw_text)
print("Created {} chunks from {} words".format(len(chunks), len(raw_text.split())))

Embed and Store

Each chunk needs to be converted into a numerical vector (an embedding) that captures its semantic meaning. You then store these vectors so you can search them quickly. For a production system, you would use a vector database like Pinecone or Chroma. For a beginner build, a numpy array in memory is sufficient and requires no infrastructure setup.

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def build_index(chunks: list) -> tuple:
    """Embed all chunks and return (embeddings array, original chunks)."""
    print("Embedding {} chunks...".format(len(chunks)))
    embeddings = model.encode(chunks, show_progress_bar=True, normalize_embeddings=True)
    return np.array(embeddings), chunks

def retrieve(query: str, embeddings: np.ndarray, chunks: list, top_k: int = 3) -> list:
    """Return the top_k most relevant chunks for a query."""
    query_embedding = model.encode([query], normalize_embeddings=True)
    # Dot product of normalized vectors equals cosine similarity
    scores = np.dot(embeddings, query_embedding.T).flatten()
    top_indices = np.argsort(scores)[::-1][:top_k]
    return [chunks[i] for i in top_indices]

# Build the index once, then reuse it for all queries
embeddings, chunks = build_index(chunks)

Retrieve and Answer

At query time, embed the user's question, retrieve the most relevant chunks, and pass them to Claude as context. The model answers using only the retrieved content, which keeps the response grounded in your actual documents.

import anthropic

client = anthropic.Anthropic()

def answer_question(question: str, embeddings: np.ndarray, chunks: list) -> str:
    # Step 1: Retrieve relevant context
    relevant_chunks = retrieve(question, embeddings, chunks, top_k=3)
    context = "

---

".join(relevant_chunks)

    # Step 2: Generate answer grounded in context
    prompt = """Use the following context to answer the question.
If the answer is not in the context, say so clearly.

Context:
{}

Question: {}""".format(context, question)

    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

# Ask anything about your document
answer = answer_question(
    "What is the warranty period for the product?",
    embeddings,
    chunks
)
print(answer)

Prompt

"You are a helpful assistant that answers questions strictly based on the provided context. If the context does not contain enough information to fully answer the question, state what is known from the context and what information is missing. Never make up facts not present in the context."

This is a complete, working RAG system in under 80 lines. The dependencies are pypdf, sentence-transformers, numpy, and anthropic. No hosted vector database, no LangChain, no complex orchestration. Start here, get it working end to end, and then add complexity only where your use case demands it.

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

RAG From Zero: Build a Bot That Answers Your Docs

Key Takeaways

Load and Chunk Your Document

Embed and Store

Retrieve and Answer

Want to build this live with Aki?

Aki Wijesundara

Ready to Launch Your AI Career?

Table of Contents

Share Article

Get Weekly AI Career Tips