RAG Over PDFs: Handle Charts and Screenshots

Why PDFs Break Standard RAG

PDFs are the most common document format in enterprise environments, and they are also the most hostile to standard RAG pipelines. PDFs are designed for visual rendering, not text extraction. A two-column academic paper extracts as a jumbled mix of text from both columns interleaved. A chart is stored as a vector graphic with no text content at all. A table might be a grid of positioned text boxes with no semantic relationship between cells.

If you run a standard PDF parser on a complex document and embed the extracted text, you are embedding noise. The chunks your vector store returns will be garbled fragments that confuse the model and produce wrong or incomplete answers. You need a pipeline that handles each content type correctly and deliberately.

Classifying and Extracting by Content Type

Start by identifying what is on each page. A page can contain flowing text, structured tables, images, charts, screenshots, or a mix. Different content types need different extraction strategies.

For flowing text pages, a library like pymupdf4llm gives you clean markdown-formatted text that preserves paragraph structure and heading hierarchy. For tables, pdfplumber's table extraction identifies cell boundaries and outputs structured rows and columns you can convert to markdown. For images and charts, you need to extract the image pixels and run a vision model over them to convert the visual content into descriptive text you can embed and retrieve.

Using Vision Models to Handle Images and Charts

When a PDF page contains a chart, screenshot, or diagram, the most effective approach is to render the entire page as an image and pass it to a vision-capable LLM with a prompt asking it to extract or describe the content. The model sees the actual rendered output, including axes, labels, colors, and visual relationships, and converts that into text you can embed.

You can reduce the cost of this approach by first checking whether a page contains images before routing it to the vision model. Most PDF libraries expose an image list per page. Only pages with images, complex layouts, or no extractable text need the vision model pass. Pure text pages can be processed cheaply with a standard text extractor.

The Full PDF Ingestion Pipeline

The code below shows a complete ingestion pipeline that routes text pages through pymupdf4llm and image-heavy pages through a vision model. Each chunk is stored with a type tag so your retrieval layer knows what kind of content it is returning.

pip install pymupdf4llm anthropic pillow PyMuPDF

import pymupdf4llm
import anthropic
import base64
import fitz  # PyMuPDF

def describe_page_image(page_bytes, client):
    img_b64 = base64.standard_b64encode(page_bytes).decode("utf-8")
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": img_b64,
                    },
                },
                {
                    "type": "text",
                    "text": (
                        "Describe this page in detail. "
                        "If it contains a chart or graph, extract all visible data values, "
                        "axis labels, and the chart title. "
                        "If it contains a table, output the table as markdown."
                    ),
                },
            ],
        }],
    )
    return response.content[0].text

def ingest_pdf(pdf_path):
    client = anthropic.Anthropic()
    chunks = []

    # Extract all text as markdown (handles flowing text pages well)
    md_text = pymupdf4llm.to_markdown(pdf_path)
    chunks.append({"type": "text", "content": md_text, "source": pdf_path})

    # Find and handle image-heavy pages separately
    doc = fitz.open(pdf_path)
    for page_num, page in enumerate(doc):
        if len(page.get_images(full=True)) > 0:
            pix = page.get_pixmap(dpi=150)
            img_bytes = pix.tobytes("png")
            description = describe_page_image(img_bytes, client)
            chunks.append({
                "type": "visual",
                "content": description,
                "source": "{} page {}".format(pdf_path, page_num + 1),
            })

    print("Extracted {} chunks from {}".format(len(chunks), pdf_path))
    return chunks

all_chunks = ingest_pdf("technical_report.pdf")

After ingestion, embed all chunks using your standard text embedding model and insert them into your vector store with the type metadata. At query time, your retrieval will surface both text chunks and vision-generated descriptions. The LLM generating the final answer reads the same structured content regardless of whether it originated from text extraction or a vision model pass.

Prompt

"I have annual reports in PDF format that are full of financial charts. My text-only RAG system cannot answer questions about those charts. Write me a step-by-step plan to upgrade my ingestion pipeline to handle chart pages using a vision model, and tell me what metadata I should store alongside each extracted chart description."

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

RAG Over PDFs: Handle Charts and Screenshots

Key Takeaways

Why PDFs Break Standard RAG

Classifying and Extracting by Content Type

Using Vision Models to Handle Images and Charts

The Full PDF Ingestion Pipeline

Want to build this live with Aki?

Aki Wijesundara

Ready to Launch Your AI Career?

Table of Contents

Share Article

Get Weekly AI Career Tips