Multimodal RAG for Product Catalogs

The Product Catalog Retrieval Problem

A product catalog is one of the most valuable datasets an e-commerce or retail company owns. It contains product images, names, descriptions, specifications, prices, and categories. Users want to search it in natural language: "show me a red dress under $100" or "find a wireless keyboard similar to this photo." A text-only search system handles the first query but fails on the second. A pure image search handles the second but struggles with attribute-heavy queries like the first.

Multimodal RAG for product catalogs means building a search system that understands both text and images simultaneously. You want a user's text query to surface products based on their visual appearance, and a user's image query to surface products based on their attributes. The key ingredient is a shared embedding space where product images and product text land close together.

Encoding Images and Text Together with CLIP

CLIP (Contrastive Language-Image Pre-Training) is a model family trained to embed images and text into a shared vector space. Given a product image and its description, CLIP places their embeddings close together. This means you can query with a text description and retrieve images, or query with an image and retrieve matching text descriptions and product names.

For product catalogs, the standard approach creates one embedding per product that combines both the image and the textual metadata. You can do this by concatenating the image embedding and the text embedding, by averaging them, or by using a late-interaction model that scores them together at query time. Averaging with L2 normalization is the simplest starting point and works well for most catalogs.

Building the Combined Index

The code below shows how to embed product images and text together using OpenCLIP, then store the combined embeddings in a FAISS index. In production, replace the in-memory FAISS index with Qdrant, Weaviate, or another vector database that supports metadata filtering and persistent storage.

pip install open-clip-torch pillow numpy faiss-cpu

import open_clip
import torch
import numpy as np
import faiss
from PIL import Image

# Load CLIP model
model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="openai"
)
tokenizer = open_clip.get_tokenizer("ViT-B-32")
model.eval()

def embed_product(image_path, product_text):
    img = preprocess(Image.open(image_path)).unsqueeze(0)
    tokens = tokenizer([product_text])

    with torch.no_grad():
        img_features = model.encode_image(img)
        txt_features = model.encode_text(tokens)

    img_features /= img_features.norm(dim=-1, keepdim=True)
    txt_features /= txt_features.norm(dim=-1, keepdim=True)

    # Average image and text embeddings for a combined representation
    combined = (img_features + txt_features) / 2
    combined /= combined.norm(dim=-1, keepdim=True)
    return combined.squeeze().numpy()

# Sample catalog entries
products = [
    {"id": "p001", "image": "shirt_blue.jpg", "text": "Blue cotton slim-fit button-up shirt, size M"},
    {"id": "p002", "image": "dress_red.jpg", "text": "Red floral midi dress, A-line cut, size S"},
    {"id": "p003", "image": "keyboard_wireless.jpg", "text": "Wireless mechanical keyboard, compact layout, USB-C"},
]

dim = 512  # ViT-B-32 output dimension
index = faiss.IndexFlatIP(dim)
product_ids = []

for product in products:
    embedding = embed_product(product["image"], product["text"])
    index.add(np.array([embedding]))
    product_ids.append(product["id"])

print("Index built with {} products".format(len(product_ids)))

Querying Your Catalog End to End

At query time, embed the user's text query using the CLIP text encoder and search the index. Because images and text share the same embedding space, a text query automatically retrieves products whose images match the description, even if the product's own text uses different terminology than the query. This cross-modal retrieval is the core value of the shared embedding space.

For image-to-product search, embed the query image using the CLIP image encoder instead. The same index returns the closest products by visual similarity. You can also do hybrid queries: embed both a reference image and a text modifier like "but in blue," then average the two embeddings to retrieve products that match the visual style with the specified attribute change. This attribute-conditioned image search is a powerful pattern for style-based product discovery.

Prompt

"I am building a product search feature for a furniture e-commerce store. Users want to upload a photo of a room and find furniture that matches the style. We have 50,000 products, each with one to five images and a text description. Walk me through how to build a multimodal retrieval system that handles both text queries and image queries from the same index."

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

Multimodal RAG for Product Catalogs

Key Takeaways

The Product Catalog Retrieval Problem

Encoding Images and Text Together with CLIP

Building the Combined Index

Querying Your Catalog End to End

Want to build this live with Aki?

Aki Wijesundara

Ready to Launch Your AI Career?

Table of Contents

Share Article

Get Weekly AI Career Tips