Document Processing for Agents
“Garbage In, Garbage Out. The Art of Reading Messy Data.”
TL;DR
80% of enterprise knowledge lives in unstructured documents — PDFs, slides, spreadsheets, screenshots. Text-based extraction (pypdf) is fast but breaks on scans, tables, and complex layouts. OCR handles scans but is slow and error-prone. Vision-Language Models (GPT-4o Vision, LlamaParse) are the new gold standard: they see layout, reconstruct tables into Markdown, and describe charts. For chunking, recursive character splitting is the default; semantic chunking splits by meaning shifts. Document processing determines the ceiling of your RAG system’s intelligence.

1. Introduction: The PDF Trap
We like to think of enterprise data as clean rows in a SQL database. In reality, 80% of enterprise knowledge is locked in “Unstructured Documents”: PDF contracts, PowerPoint slides, Excel financial models, and PNG screenshots of dashboards.
For a human, a PDF is easy to read. For an LLM, a PDF is a nightmare.
- No Semantic Structure: A PDF doesn’t know what a “Paragraph” or a “Header” is. It only knows “Place character ‘A’ at coordinates x=10, y=20.”
- Tables: A table in a PDF is just a grid of lines and floating text. Reconstructing the row/column relationship is an NP-hard problem for traditional parsers.
- Layout: Multi-column layouts confuse standard text extractors (reading across columns instead of down).
If your Agent’s retrieval system (RAG) feeds it garbage text from a broken PDF parse, the Agent will fail to answer even simple questions. Document Processing is the unsexy but critical precursor to intelligence.
2. Extraction Strategies: The Hierarchy of Power
How do we get text out? There are three main strategies, ordered by cost and quality.
2.1 Strategy 1: Text-Based Extraction (The Cheap Way)
- Tools:
pypdf,PyMuPDF,pdfplumber. - Mechanism: Extracts the underlying text stream embedded in the file.
- Pros: Extremely fast (milliseconds). Cheap (CPU only).
- Cons:
- Fails on Scanned PDFs (images wrapped in PDF).
- Fails on complex layouts (merges columns).
- Fails on tables (flattens them into a string mess).
2.2 Strategy 2: OCR (Optical Character Recognition)
- Tools:
Tesseract(Open Source), Amazon Textract, Google Document AI, Azure Layout Analysis. - Mechanism: Renders the PDF as an image, then looks for shapes of letters.
- Pros: Works on scanned documents and screenshots. Can detect “Forms” and key-value pairs.
- Cons: Slow. Expensive. Often misreads “0” (zero) as “O” (letter) or “1” as “l”.
2.3 Strategy 3: Vision-Language Models (The New Gold Standard)
- Tools: GPT-4o Vision, LlamaParse (LlamaIndex), Unstructured.io.
- Mechanism:
- Take a screenshot of the PDF page.
- Send it to a Multi-Modal LLM.
- Prompt: “Transcribe this page into Markdown, carefully preserving all tables and headers.”
- Pros:
- Understanding: It “sees” that bold text is a header.
- Tables: It reconstructs tables perfectly into Markdown syntax (
| col | col |). - Charts: It can describe “Revenue is uptrending” from a bar chart.
- Cons: Expensive (Vision tokens cost money). High latency (seconds per page).
3. The Table Problem: The Final Boss
Tables are the nemesis of RAG Agents.
-
Snippet: “Revenue 2020 10M” - Bad Parse: “Revenue 2020 10M” (Lost the relationship).
- Agent Query: “What was revenue in 2020?”
- Retrieved Chunk: “Revenue 2020 10M”
- Agent Answer: “The revenue is 10M.” (Lucky guess, but robust agents fail).
Solution: Markdown Conversion. We must force the extractor to output Markdown Tables. LLMs are trained heavily on Markdown (from GitHub READMEs) and can reason about them exceptionally well.
- LlamaParse is currently the state-of-the-art for this. It utilizes a trained model just to detect table borders and reconstruct the grid structure before generating text.
4. Chunking Strategies: Cutting the Cake
Once you have text, you must split it.
4.1 Recursive Character Splitting
The default.
- Split by Paragraph
\n\n. - If too big, split by Sentence
.. - If too big, split by Word ` `.
- Overlap: Always keep 50-100 tokens of overlap so sentences aren’t cut in half.
4.2 Semantic Chunking
Instead of splitting by size, split by Meaning.
- Embed every sentence.
- Calculate cosine similarity between S1 and S2.
- If similarity drops below a threshold (e.g., 0.7), start a new chunk.
- Result: You get coherent “Topics”.
5. Multi-Modal RAG
What about charts? A Vector Database cannot “search” a bar chart.
5.1 Pattern: Image-to-Text Indexing
- Extraction: Detect images in the PDF. Crop them.
- Captioning: Send the image to a Vision Model. “Describe this chart in detail, including data points.”
- Result: “A bar chart showing Q3 Sales rising by 20% to $12M.”
- Embedding: Embed the Caption (Text) into the Vector DB.
- Storage: Store the original Image path in metadata.
5.2 Retrieval Flow
- User: “Did sales go up?”
- Search matches the Caption (“Sales rising…”).
- Agent retrieves the caption and says “Yes, sales rose 20%.”
- Optional: Agent displays the original image to the user.
6. Code: Modern Parsing Pipeline (Conceptual)
How an “Agentic Ingestion Pipeline” looks in pseudocode.
def ingest_document(file_path):
# 1. Routing
if is_scanned(file_path):
mode = "vision"
else:
mode = "text"
# 2. Parsing (LlamaParse or Unstructured)
text = parser.parse(file_path, mode=mode, output_format="markdown")
# 3. Image Extraction
images = extract_images(file_path)
for img in images:
caption = vision_model.caption(img)
text += f"\n\n"
# 4. Semantic Chunking
chunks = semantic_chunker(text)
# 5. Indexing
vector_db.add(chunks)
7. Summary
Document Processing determines the Ceiling of your agent’s intelligence.
- Text Extraction: Use
pypdffor simple text,LlamaParsefor everything else. - Tables: Must be converted to Markdown.
- Charts: Must be captioned by Vision models.
To scale this to millions of documents, we must master Vector Search Algorithms like HNSW and quantization.
FAQ
What is the best way to extract text from PDFs for RAG?
For simple text PDFs, use pypdf or PyMuPDF. For scanned documents, use OCR tools like Tesseract or Amazon Textract. For complex layouts with tables and charts, Vision-Language Models like GPT-4o Vision or LlamaParse are the gold standard — they reconstruct tables into Markdown and describe charts.
Why do tables break RAG systems?
PDFs store tables as grids of lines and floating text with no semantic structure. Standard text extractors flatten tables into string messes like “Revenue 2020 10M”, losing the row-column relationships. Converting tables to Markdown format solves this because LLMs reason about Markdown tables extremely well.
What is semantic chunking?
Instead of splitting documents by fixed character count, semantic chunking embeds every sentence and calculates cosine similarity between adjacent sentences. When similarity drops below a threshold (e.g., 0.7), it starts a new chunk. This produces coherent topic-based chunks rather than arbitrary splits.
How does multi-modal RAG handle charts and images?
Detect and crop images from documents, send them to a Vision model for captioning, embed the caption text into the vector DB, and store the original image path in metadata. At retrieval time, the text caption matches the query semantically, and the agent can optionally display the original image.
Originally published at: arunbaby.com/ai-agents/0010-document-processing
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch