Document Processing for Agents

6 minute read

“Garbage In, Garbage Out. The Art of Reading Messy Data.”

TL;DR

80% of enterprise knowledge lives in unstructured documents — PDFs, slides, spreadsheets, screenshots. Text-based extraction (pypdf) is fast but breaks on scans, tables, and complex layouts. OCR handles scans but is slow and error-prone. Vision-Language Models (GPT-4o Vision, LlamaParse) are the new gold standard: they see layout, reconstruct tables into Markdown, and describe charts. For chunking, recursive character splitting is the default; semantic chunking splits by meaning shifts. Document processing determines the ceiling of your RAG system’s intelligence.

High-speed document scanner feeder with papers being fed through an illuminated scanning slot

1. Introduction: The PDF Trap

We like to think of enterprise data as clean rows in a SQL database. In reality, 80% of enterprise knowledge is locked in “Unstructured Documents”: PDF contracts, PowerPoint slides, Excel financial models, and PNG screenshots of dashboards.

For a human, a PDF is easy to read. For an LLM, a PDF is a nightmare.

No Semantic Structure: A PDF doesn’t know what a “Paragraph” or a “Header” is. It only knows “Place character ‘A’ at coordinates x=10, y=20.”
Tables: A table in a PDF is just a grid of lines and floating text. Reconstructing the row/column relationship is an NP-hard problem for traditional parsers.
Layout: Multi-column layouts confuse standard text extractors (reading across columns instead of down).

If your Agent’s retrieval system (RAG) feeds it garbage text from a broken PDF parse, the Agent will fail to answer even simple questions. Document Processing is the unsexy but critical precursor to intelligence.

2. Extraction Strategies: The Hierarchy of Power

How do we get text out? There are three main strategies, ordered by cost and quality.

2.1 Strategy 1: Text-Based Extraction (The Cheap Way)

Tools: pypdf, PyMuPDF, pdfplumber.
Mechanism: Extracts the underlying text stream embedded in the file.
Pros: Extremely fast (milliseconds). Cheap (CPU only).
Cons:
Fails on Scanned PDFs (images wrapped in PDF).
Fails on complex layouts (merges columns).
Fails on tables (flattens them into a string mess).

2.2 Strategy 2: OCR (Optical Character Recognition)

Tools: Tesseract (Open Source), Amazon Textract, Google Document AI, Azure Layout Analysis.
Mechanism: Renders the PDF as an image, then looks for shapes of letters.
Pros: Works on scanned documents and screenshots. Can detect “Forms” and key-value pairs.
Cons: Slow. Expensive. Often misreads “0” (zero) as “O” (letter) or “1” as “l”.

2.3 Strategy 3: Vision-Language Models (The New Gold Standard)

Tools: GPT-4o Vision, LlamaParse (LlamaIndex), Unstructured.io.
Mechanism:
1. Take a screenshot of the PDF page.
2. Send it to a Multi-Modal LLM.
3. Prompt: “Transcribe this page into Markdown, carefully preserving all tables and headers.”
Pros:
Understanding: It “sees” that bold text is a header.
Tables: It reconstructs tables perfectly into Markdown syntax (| col | col |).
Charts: It can describe “Revenue is uptrending” from a bar chart.
Cons: Expensive (Vision tokens cost money). High latency (seconds per page).

3. The Table Problem: The Final Boss

Tables are the nemesis of RAG Agents.

Snippet: “Revenue 2020 10M”
Bad Parse: “Revenue 2020 10M” (Lost the relationship).
Agent Query: “What was revenue in 2020?”
Retrieved Chunk: “Revenue 2020 10M”
Agent Answer: “The revenue is 10M.” (Lucky guess, but robust agents fail).

Solution: Markdown Conversion. We must force the extractor to output Markdown Tables. LLMs are trained heavily on Markdown (from GitHub READMEs) and can reason about them exceptionally well.

LlamaParse is currently the state-of-the-art for this. It utilizes a trained model just to detect table borders and reconstruct the grid structure before generating text.

4. Chunking Strategies: Cutting the Cake

Once you have text, you must split it.

4.1 Recursive Character Splitting

The default.

Split by Paragraph \n\n.
If too big, split by Sentence ..
If too big, split by Word ` `.
- Overlap: Always keep 50-100 tokens of overlap so sentences aren’t cut in half.

4.2 Semantic Chunking

Instead of splitting by size, split by Meaning.

Embed every sentence.
Calculate cosine similarity between S1 and S2.
If similarity drops below a threshold (e.g., 0.7), start a new chunk.
- Result: You get coherent “Topics”.

What about charts? A Vector Database cannot “search” a bar chart.

5.1 Pattern: Image-to-Text Indexing

Extraction: Detect images in the PDF. Crop them.
Captioning: Send the image to a Vision Model. “Describe this chart in detail, including data points.”
- Result: “A bar chart showing Q3 Sales rising by 20% to $12M.”
Embedding: Embed the Caption (Text) into the Vector DB.
Storage: Store the original Image path in metadata.

5.2 Retrieval Flow

User: “Did sales go up?”
Search matches the Caption (“Sales rising…”).
Agent retrieves the caption and says “Yes, sales rose 20%.”
Optional: Agent displays the original image to the user.

6. Code: Modern Parsing Pipeline (Conceptual)

How an “Agentic Ingestion Pipeline” looks in pseudocode.

def ingest_document(file_path):
    # 1. Routing
    if is_scanned(file_path):
        mode = "vision"
    else:
        mode = "text"

        # 2. Parsing (LlamaParse or Unstructured)
        text = parser.parse(file_path, mode=mode, output_format="markdown")

        # 3. Image Extraction
        images = extract_images(file_path)
        for img in images:
            caption = vision_model.caption(img)
            text += f"\n\n![Image]({caption})"

            # 4. Semantic Chunking
            chunks = semantic_chunker(text)

            # 5. Indexing
            vector_db.add(chunks)

7. Summary

Document Processing determines the Ceiling of your agent’s intelligence.

Text Extraction: Use pypdf for simple text, LlamaParse for everything else.
Tables: Must be converted to Markdown.
Charts: Must be captioned by Vision models.

To scale this to millions of documents, we must master Vector Search Algorithms like HNSW and quantization.

FAQ

What is the best way to extract text from PDFs for RAG?

For simple text PDFs, use pypdf or PyMuPDF. For scanned documents, use OCR tools like Tesseract or Amazon Textract. For complex layouts with tables and charts, Vision-Language Models like GPT-4o Vision or LlamaParse are the gold standard — they reconstruct tables into Markdown and describe charts.

Why do tables break RAG systems?

PDFs store tables as grids of lines and floating text with no semantic structure. Standard text extractors flatten tables into string messes like “Revenue 2020 10M”, losing the row-column relationships. Converting tables to Markdown format solves this because LLMs reason about Markdown tables extremely well.

What is semantic chunking?

Instead of splitting documents by fixed character count, semantic chunking embeds every sentence and calculates cosine similarity between adjacent sentences. When similarity drops below a threshold (e.g., 0.7), it starts a new chunk. This produces coherent topic-based chunks rather than arbitrary splits.

How does multi-modal RAG handle charts and images?

Detect and crop images from documents, send them to a Vision model for captioning, embed the caption text into the vector DB, and store the original image path in metadata. At retrieval time, the text caption matches the query semantically, and the agent can optionally display the original image.

Originally published at: arunbaby.com/ai-agents/0010-document-processing

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

Document Processing for Agents

TL;DR

1. Introduction: The PDF Trap

2. Extraction Strategies: The Hierarchy of Power

2.1 Strategy 1: Text-Based Extraction (The Cheap Way)

2.2 Strategy 2: OCR (Optical Character Recognition)

2.3 Strategy 3: Vision-Language Models (The New Gold Standard)

3. The Table Problem: The Final Boss

4. Chunking Strategies: Cutting the Cake

4.1 Recursive Character Splitting

4.2 Semantic Chunking

5.1 Pattern: Image-to-Text Indexing

5.2 Retrieval Flow

6. Code: Modern Parsing Pipeline (Conceptual)

7. Summary

FAQ

Related across topics

Share on

TL;DR

1. Introduction: The PDF Trap

2. Extraction Strategies: The Hierarchy of Power

2.1 Strategy 1: Text-Based Extraction (The Cheap Way)

2.2 Strategy 2: OCR (Optical Character Recognition)

2.3 Strategy 3: Vision-Language Models (The New Gold Standard)

3. The Table Problem: The Final Boss

4. Chunking Strategies: Cutting the Cake

4.1 Recursive Character Splitting

4.2 Semantic Chunking

5. Multi-Modal RAG

5.1 Pattern: Image-to-Text Indexing

5.2 Retrieval Flow

6. Code: Modern Parsing Pipeline (Conceptual)

7. Summary

FAQ

Related across topics

Reverse Linked List

Caching Strategies for ML Systems

Voice Enhancement & Noise Reduction

Share on