Index Video and Audio for AI Retrieval

The Problem with Video and Audio Content

Video and audio are the dark matter of enterprise knowledge. A company might have hundreds of recorded team meetings, customer calls, training sessions, and product demos, with no way to find what was said in any of them. Standard search does not work on media files. You need to extract the text, index it, and retrieve it semantically.

The good news: the full pipeline covering transcription, chunking, embedding, and retrieval can be built in Python in an afternoon.

Transcribing with Whisper and Chunking by Time

OpenAI's Whisper model runs locally and produces word-level timestamps. Those timestamps are what let you retrieve a specific clip rather than just a document.

import whisper
import json

def transcribe_with_timestamps(audio_path, chunk_duration=30):
    model = whisper.load_model("base")
    result = model.transcribe(audio_path, word_timestamps=True)

    chunks = []
    for segment in result["segments"]:
        chunk_idx = int(segment["start"] // chunk_duration)
        while len(chunks) <= chunk_idx:
            start = len(chunks) * chunk_duration
            chunks.append({
                "chunk_id": len(chunks),
                "start_sec": start,
                "end_sec": start + chunk_duration,
                "text": ""
            })
        chunks[chunk_idx]["text"] += " " + segment["text"].strip()

    return [c for c in chunks if c["text"].strip()]

chunks = transcribe_with_timestamps("meeting_recording.mp4")
with open("chunks.json", "w") as f:
    json.dump(chunks, f, indent=2)
print(f"Produced {len(chunks)} chunks with timestamps")

Each chunk records its start and end time so search results link back to the exact moment in the recording, not just the file.

Embedding and Building the Index

Once you have chunks, embed each one and store it in a vector index. For a library of up to a few thousand videos, a FAISS flat index is simple and fast enough to get started.

import numpy as np
import faiss
import json, pickle
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

with open("chunks.json") as f:
    chunks = json.load(f)

texts = [c["text"] for c in chunks]
embeddings = model.encode(texts, convert_to_numpy=True)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)

faiss.write_index(index, "video_index.faiss")
with open("chunks_meta.pkl", "wb") as f:
    pickle.dump(chunks, f)
print(f"Indexed {len(chunks)} chunks")

Querying for Timestamped Clips

With the index built, a semantic query returns the matching moments with timestamps you can use to deep-link into a video player or extract a clip with ffmpeg.

Prompt

"Build a query function that takes a question string, embeds it, searches the FAISS index for the top 5 matches, and returns each result with its source file name, start time in MM:SS format, and a direct timestamp URL like: https://example.com/player?t=240"

The full pipeline turns any collection of recorded meetings or training sessions into a searchable knowledge base. Teams stop re-explaining things that were covered in a call three months ago because now anyone can find it in seconds.

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

Index Video and Audio for AI Retrieval

Key Takeaways

The Problem with Video and Audio Content

Transcribing with Whisper and Chunking by Time

Embedding and Building the Index

Querying for Timestamped Clips

Want to build this live with Aki?

Aki Wijesundara

Ready to Launch Your AI Career?

Table of Contents

Share Article

Get Weekly AI Career Tips