✦ AI Engineering Bootcamp

RAG + Vector Databases

Build Smarter AI Apps with LangChain

Maven Session

Aki Wijesundara, PhD

Instructor

Overview

Today's Agenda

What we'll cover in this session

Why RAG Exists

The problem with LLMs and how RAG solves it

Embeddings Deep Dive

How text becomes vectors and why it matters

Vector Databases

Storing and searching semantic data at scale

Chunking & Retrieval

The make-or-break step most people get wrong

Live Build

Build a working Document Q&A system

Goals

What We're Building Today

By the end of this session, you will:

🧠

Understand RAG

How RAG works end-to-end

🔢

Master Embeddings

How vectors power semantic search

🛠️

Build a Q&A System

Working document Q&A with LangChain

Format: Theory → Architecture → Live Build → Evaluation

The Problem

Why RAG Exists

LLMs are powerful but have critical limitations

📅

Knowledge Cutoff

They don't know what happened after training

🎭

Hallucinations

They confidently make things up

🔒

No Access to YOUR Data

Can't read your docs, PDFs, Slack, databases

The question: How do we ground LLM responses in real, up-to-date, trusted information?

Core Concept

What is RAG?

Retrieval-Augmented Generation

Think of it like an open-book exam — the LLM doesn't need to memorize everything, it just needs to look up the right pages.

Comparison

RAG vs Fine-Tuning vs Prompt Engineering

Choose the right approach for your use case

Approach	When to Use	Cost	Freshness
Prompt Engineering	Small context, simple tasks	Low	Manual updates
RAG	Large/changing knowledge bases	Medium	Real-time
Fine-Tuning	Changing model behavior/tone	High	Requires retraining

Key insight: RAG handles knowledge, fine-tuning handles behavior — they're not either/or.

Architecture

RAG Pipeline — Indexing

The offline step: prepare your knowledge base for search

Indexing Steps

1. Load documents (PDF, web, etc.)

2. Split into chunks

3. Embed each chunk

4. Store in Vector DB

One-time setup — run once per document update

Source: LangChain Documentation

Architecture

RAG Pipeline — Retrieval & Generation

The runtime step: answer questions using your knowledge

Query Steps

1. User asks question

2. Embed the query

3. Search vector DB

4. Retrieve top K chunks

5. Augment prompt

6. LLM generates answer

Every query — runs in real-time

Source: LangChain Documentation

Architecture

RAG Architecture — Complete Pipeline

Combining indexing and retrieval into a unified system

Indexing (Offline)

Load → Chunk → Embed → Store in Vector DB

Retrieval (Runtime)

Query → Embed → Search → Retrieve Top K

Generation

Augment Prompt → LLM → Grounded Answer

Key insight: Same embedding model for both indexing and querying!

Embeddings

Let's Talk About Embeddings

A numerical representation of text that captures semantic meaning

# Similar meanings → Similar vectors
"The cat sat on the mat"     → [0.12, -0.34, 0.56, ...]
"A kitten rested on the rug" → [0.11, -0.33, 0.55, ...] ← Very similar!

# Different meanings → Different vectors
"Stock prices fell sharply"  → [0.87, 0.22, -0.91, ...] ← Very different

Why it matters: Similar meanings → similar vectors → we can search by meaning, not just keywords.

Embeddings

Embeddings — Visual Intuition

Imagine a space where every sentence has a position

Semantic search: Your query gets placed in this space and we find the nearest neighbors.

Embeddings

Popular Embedding Models

Choose based on your use case

Model	Dimensions	Provider	Best For
text-embedding-3-small	1536	OpenAI	Cost-effective, general purpose
text-embedding-3-large	3072	OpenAI	Higher accuracy tasks
voyage-3	1024	Voyage AI	Code & technical docs
all-MiniLM-L6-v2	384	HuggingFace	Local/free, lightweight

Tradeoff: More dimensions = more nuance but higher storage & latency.

Chunking

Chunking — The Hidden Make-or-Break Step

Before embedding, you need to split documents into chunks

📏

Token Limits

Models have limits (512–8192 tokens)

🎯

Precision

Smaller chunks = precise retrieval

⚖️

Balance

Too large = diluted; too small = lost context

Chunking

Chunking Strategies

Different approaches for different use cases

Strategy	How It Works	Best For
Fixed-size	Split every N characters/tokens	Simple, predictable
Recursive	Split by paragraphs → sentences → words	General-purpose (LangChain default)
Semantic	Split when meaning shifts	High-quality but slower
Document-aware	Split by headers, sections, pages	Structured docs (markdown, HTML)

Pro tip: Always use overlap (e.g., 200 chars) between chunks so you don't lose context at boundaries.

Vector Databases

What is a Vector Database?

Specialized storage optimized for high-dimensional vectors

📊 Traditional DB

SELECT * FROM docs
WHERE title = 'RAG'

Exact match only

🔮 Vector DB

# Find 5 most similar
db.similarity_search(
  query_vector, k=5
)

Semantic match

Uses algorithms like HNSW or IVF to search millions of vectors in milliseconds.

Vector Databases

Vector DB Landscape

Pick the right tool for your needs

ChromaDB

Open source, embedded. Great for prototyping.

Pinecone

Managed cloud. Serverless, easy to scale.

Weaviate

Hybrid search (vector + keyword).

Qdrant

Filtering + payload support.

pgvector

Postgres extension. Use existing DB.

FAISS

Meta library. Fast local search.

For today's build: We'll use ChromaDB (simple, local, zero config)

Vector Databases

Similarity Search — How It Works

Cosine Similarity — The most common metric

Measures the angle between vectors

-1.0

Opposite

0.0

Unrelated

1.0

Identical

Euclidean (L2)

Straight-line distance between points.

Dot Product

Similar to cosine but scale-sensitive.

Retrieval

The Retrieval Step — Beyond Naive Search

Basic: embed query → cosine search → return top K. Here's how to do better:

🏷️

Metadata Filtering

Filter by date, source, category before vector search

🔀

Hybrid Search

Combine vector similarity with keyword matching (BM25)

🔄

Multi-query

Rephrase question multiple ways, search each, merge results

⬆️

Reranking

Use a cross-encoder to re-score the top results

Augmentation

Prompt Augmentation — The Glue

Format retrieved chunks into the LLM prompt

# The RAG prompt template
prompt = """
You are a helpful assistant. Answer the question based ONLY on
the following context. If the context doesn't contain the answer,
say "I don't have enough information to answer that."

Context:
{retrieved_chunks}

Question: {user_question}
"""

Be Explicit

"Use ONLY this context"

Source Attribution

Ask for citations

Handle Unknown

"I don't know" reduces hallucination

🛠️ Live Build

Let's Build — Architecture Overview

What we're building: A Document Q&A system

🛠️ Build Step 1

Setup & Document Loading

Install dependencies and load your documents

# Install dependencies
pip install langchain langchain-openai langchain-community chromadb

# Load documents
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("your_document.pdf")
documents = loader.load()

LangChain supports 80+ loaders: PDF, CSV, Notion, Slack, URLs, Google Drive, etc.

🛠️ Build Step 2

Chunking

Split documents into manageable pieces

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\\n\\n", "\\n", " ", ""]
)

chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")

Why RecursiveCharacterTextSplitter? It tries to split at natural boundaries (paragraphs → sentences → words).

🛠️ Build Step 3

Embed & Store

Convert chunks to vectors and store in ChromaDB

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

1. Chunks → Embedding API

2. Vectors → ChromaDB

3. Persisted to disk

🛠️ Build Step 4

Query & Retrieve

Test your retrieval before wiring up the LLM

# Simple similarity search
results = vectorstore.similarity_search(
    "What are the key findings?",
    k=3
)

for doc in results:
    print(doc.page_content[:200])
    print(doc.metadata)
    print("---")

⚠️ Test this first! If retrieval is bad, generation will be bad — garbage in, garbage out.

🛠️ Build Step 5

The RAG Agent

Connect retrieval to the LLM using modern agent pattern

from langchain_openai import ChatOpenAI
from langchain.agents import create_agent
from langchain.tools import tool

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Create retrieval tool
@tool(response_format="content_and_artifact")
def retrieve_context(query: str):
    """Retrieve information from documents."""
    docs = vectorstore.similarity_search(query, k=3)
    serialized = "\n\n".join(
        f"Source: {doc.metadata}\nContent: {doc.page_content}"
        for doc in docs
    )
    return serialized, docs

# Create agent
tools = [retrieve_context]
agent = create_agent(model, tools, system_prompt="Use the tool to answer queries.")

# Query
for event in agent.stream(
    {"messages": [{"role": "user", "content": "What is RAG?"}]},
    stream_mode="values"
):
    print(event["messages"][-1].content)

Modern pattern: Agent decides when to retrieve, supports multiple searches per query

Debugging

Common RAG Failure Modes

Know the symptoms and fixes

Problem	Symptom	Fix
Bad chunking	Half-relevant text	Adjust chunk size, semantic splitting
Wrong K value	Too much noise or missing context	Experiment with K=3 to K=10
Embedding mismatch	Irrelevant results	Try different embedding model
Prompt leakage	LLM ignores context	Strengthen grounding instructions
Boundary issues	Answer split across chunks	Increase chunk overlap

Evaluation

Evaluating Your RAG System

You can't improve what you can't measure

🎯

Retrieval Relevance

Are the right chunks being returned?

📎

Answer Faithfulness

Is the answer grounded in the context?

✓

Answer Correctness

Is the answer factually right?

Quick evaluation approach:

Create 10-20 test questions with known answers
Run them through your pipeline
Check: Right chunks retrieved? Correct answer?

Tools: RAGAS, LangSmith, manual review

Next Steps

Level Up — What's Next

Once your basic RAG works, explore:

🔀 Hybrid Search

Combine BM25 keyword search with vector search

⬆️ Reranking

Use a cross-encoder (e.g., Cohere Rerank) to re-score results

🖼️ Multi-modal RAG

Embed images, tables, charts alongside text

🤖 Agentic RAG

Let the LLM decide when and what to retrieve

Summary

Key Takeaways

The essential points to remember

RAG = Retrieval + Augmentation + Generation — it's an open-book exam for LLMs

Embeddings capture meaning — enabling search by concept, not just keywords

Chunking is critical — bad chunks = bad retrieval = bad answers

Always test retrieval first — before wiring up generation

Start simple, iterate — basic RAG with good chunking beats complex RAG with bad data

Assignment

Homework — Build a RAG System on YOUR Data

The task: Build a working document Q&A pipeline using real data you care about — not a tutorial dataset.

6 Required Steps:

Load your documents (PDF, CSV, web pages — your choice)
Chunk with RecursiveCharacterTextSplitter (try 2 different chunk sizes)
Embed & store in ChromaDB
Test retrieval FIRST — check if the right chunks come back before touching the LLM
Wire up the full RAG agent using create_agent with @tool
Evaluate — 5 questions you know the answer to, score retrieval + faithfulness + correctness

Stretch Goals (pick 1+):

Chunk size comparison
Hybrid search
Metadata filtering
Streamlit UI
Multi-document ingestion

What to submit: Working notebook + short README with your observations — including what broke.

💬

Questions?

Let's discuss what we've learned today

Resources

LangChain Docs: python.langchain.com
ChromaDB Docs: docs.trychroma.com
Embedding Leaderboard: huggingface.co/spaces/mteb/leaderboard
RAGAS (Evaluation): docs.ragas.io

Your challenge: Build a RAG system on YOUR data this week!