Build a Semantic Search Engine in 30 Minutes
A step-by-step guide to building a working semantic search engine over your own document corpus using sentence-transformers and FAISS, with a clear path to production.
Key Takeaways
- Comprehensive strategies proven to work at top companies
- Actionable tips you can implement immediately
- Expert insights from industry professionals
The Architecture in Plain English
A semantic search engine has two phases: ingestion and retrieval. During ingestion, you take your document corpus, convert each document or chunk into an embedding vector, and store those vectors in an index that supports fast similarity search. During retrieval, you convert the user's query into an embedding using the same model, then find the stored vectors closest to it.
The key insight is that both the documents and the query pass through the same embedding model. The model maps everything into the same high-dimensional space, so documents and queries that are semantically similar end up near each other regardless of surface-level word overlap. A query about "reducing infrastructure spend" will surface documents about "cutting cloud costs" without sharing a single content word.
In this walkthrough you will build a working semantic search engine using sentence-transformers for embeddings and FAISS for vector indexing. FAISS is a library from Meta that provides fast similarity search on dense vectors. It runs entirely in memory and handles datasets up to a few million vectors efficiently. For production at larger scale, you would replace FAISS with a persistent vector database like Qdrant, but FAISS is ideal for learning the core mechanics without infrastructure overhead.
Embedding Your Documents with sentence-transformers
The sentence-transformers library provides access to a wide range of pre-trained embedding models. The model all-MiniLM-L6-v2 is a good default: it produces 384-dimensional embeddings, runs fast on CPU, and has strong semantic quality for English text. For multilingual use cases, paraphrase-multilingual-MiniLM-L12-v2 covers 50 languages with similar speed.
For the ingestion pipeline, load your documents, encode them all at once (batch encoding is significantly faster than encoding one at a time), and store the resulting numpy array alongside your document metadata. We use IndexFlatIP (inner product) rather than IndexFlatL2 (L2 distance). After L2-normalizing the vectors, inner product search is equivalent to cosine similarity search and is slightly faster in FAISS.
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import json
model = SentenceTransformer("all-MiniLM-L6-v2")
with open("documents.json", "r") as f:
documents = json.load(f)
texts = [doc["content"] for doc in documents]
embeddings = model.encode(texts, show_progress_bar=True)
embeddings = np.array(embeddings).astype("float32")
faiss.normalize_L2(embeddings)
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(embeddings)
faiss.write_index(index, "docs.index")
with open("doc_store.json", "w") as f:
json.dump(documents, f)
print("Indexed {} documents, dimension {}".format(len(documents), dimension))
The documents.json file should be a list of objects, each with at minimum a "content" field containing the text to embed and a "title" field for display in search results. Chunk long documents into passages of 200 to 400 words before embedding: embedding entire long documents degrades semantic quality because the model must compress too much content into a single fixed-size vector.
Querying with Cosine Similarity
The search function mirrors the ingestion pipeline exactly. Encode the query using the same model, normalize the query vector, and call index.search. FAISS returns the top-k most similar document indices and their similarity scores. The scores from IndexFlatIP after L2 normalization range from -1 to 1, where 1 is a perfect match. In practice, strong matches score above 0.8 and weak matches score below 0.5.
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import json
model = SentenceTransformer("all-MiniLM-L6-v2")
index = faiss.read_index("docs.index")
with open("doc_store.json", "r") as f:
documents = json.load(f)
def search(query, top_k=5, min_score=0.5):
query_embedding = model.encode([query])
query_embedding = np.array(query_embedding).astype("float32")
faiss.normalize_L2(query_embedding)
scores, indices = index.search(query_embedding, top_k)
results = []
for score, idx in zip(scores[0], indices[0]):
if float(score) >= min_score:
results.append({
"title": documents[idx]["title"],
"score": float(score),
"excerpt": documents[idx]["content"][:200]
})
return results
results = search("what is retrieval augmented generation?")
for r in results:
print("[{:.4f}] {}".format(r["score"], r["title"]))
The min_score threshold is a practical addition: without it, FAISS always returns exactly top_k results even when none of them are meaningfully similar to the query. Setting a minimum score of 0.5 prevents the search from returning irrelevant results for queries that have no good match in your index. Calibrate this threshold on a sample of representative queries and their known-good results.
Extending to a Production Vector Database
FAISS is in-memory only. If your process restarts, you lose the index unless you save and reload it from disk. It also does not support metadata filtering, multi-tenancy, or real-time updates without rebuilding the index. For production workloads, you will want to migrate to a persistent vector database.
The migration from FAISS to Qdrant is straightforward. Create a Qdrant collection with the same vector dimension and distance metric. Loop over your documents, upsert each one as a PointStruct carrying both the embedding and your metadata payload, and then replace your FAISS index.search call with a Qdrant client.search call. The embedding model and the search logic stay exactly the same. The vector database handles persistence, payload filtering, and horizontal scaling for you.
Once on a production vector database, you can add metadata filtering (search only within a specific category or date range), multi-tenant isolation (each user's documents filtered by user_id payload), and real-time updates (upsert new documents without rebuilding the entire index). These are the capabilities that justify the infrastructure cost, and they are exactly where FAISS falls short at scale.
Prompt
"I have a working FAISS semantic search prototype. Walk me through migrating it to Qdrant. I need to preserve all existing metadata fields (title, author, date, category) and add metadata filtering by category and date range. Show me the updated ingestion and search code side-by-side with my existing FAISS code."
Want to build this live with Aki?
Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →
Aki Wijesundara
Expert team of AI professionals and career advisors with experience at top tech companies. We've helped 500+ students land internships at Google, Meta, OpenAI, and other leading AI companies.
Ready to Launch Your AI Career?
Join our comprehensive program and get personalized guidance from industry experts who've been where you want to go.
Table of Contents
Share Article
Get Weekly AI Career Tips
Join 5,000+ professionals getting actionable career advice in their inbox.
No spam. Unsubscribe anytime.