Document Processing for Agents
“Garbage In, Garbage Out. The Art of Reading Messy Data.”
1. Introduction: The PDF Trap
We like to think of enterprise data as clean rows in a SQL database. In reality, 80% of enterprise knowledge is locked in “Unstructured Documents”: PDF contracts, PowerPoint slides, Excel financial models, and PNG screenshots of dashboards.
For a human, a PDF is easy to read. For an LLM, a PDF is a nightmare.
- No Semantic Structure: A PDF doesn’t know what a “Paragraph” or a “Header” is. It only knows “Place character ‘A’ at coordinates x=10, y=20.”
- Tables: A table in a PDF is just a grid of lines and floating text. Reconstructing the row/column relationship is an NP-hard problem for traditional parsers.
- Layout: Multi-column layouts confuse standard text extractors (reading across columns instead of down).
If your Agent’s retrieval system (RAG) feeds it garbage text from a broken PDF parse, the Agent will fail to answer even simple questions. Document Processing is the unsexy but critical precursor to intelligence.
2. Extraction Strategies: The Hierarchy of Power
How do we get text out? There are three main strategies, ordered by cost and quality.
2.1 Strategy 1: Text-Based Extraction (The Cheap Way)
- Tools:
pypdf,PyMuPDF,pdfplumber. - Mechanism: Extracts the underlying text stream embedded in the file.
- Pros: Extremely fast (milliseconds). Cheap (CPU only).
- Cons:
- Fails on Scanned PDFs (images wrapped in PDF).
- Fails on complex layouts (merges columns).
- Fails on tables (flattens them into a string mess).
2.2 Strategy 2: OCR (Optical Character Recognition)
- Tools:
Tesseract(Open Source), Amazon Textract, Google Document AI, Azure Layout Analysis. - Mechanism: Renders the PDF as an image, then looks for shapes of letters.
- Pros: Works on scanned documents and screenshots. Can detect “Forms” and key-value pairs.
- Cons: Slow. Expensive. Often misreads “0” (zero) as “O” (letter) or “1” as “l”.
2.3 Strategy 3: Vision-Language Models (The New Gold Standard)
- Tools: GPT-4o Vision, LlamaParse (LlamaIndex), Unstructured.io.
- Mechanism:
- Take a screenshot of the PDF page.
- Send it to a Multi-Modal LLM.
- Prompt: “Transcribe this page into Markdown, carefully preserving all tables and headers.”
- Pros:
- Understanding: It “sees” that bold text is a header.
- Tables: It reconstructs tables perfectly into Markdown syntax (
| col | col |). - Charts: It can describe “Revenue is uptrending” from a bar chart.
- Cons: Expensive (Vision tokens cost money). High latency (seconds per page).
3. The Table Problem: The Final Boss
Tables are the nemesis of RAG Agents.
-
Snippet: “Revenue 2020 10M” - Bad Parse: “Revenue 2020 10M” (Lost the relationship).
- Agent Query: “What was revenue in 2020?”
- Retrieved Chunk: “Revenue 2020 10M”
- Agent Answer: “The revenue is 10M.” (Lucky guess, but robust agents fail).
Solution: Markdown Conversion. We must force the extractor to output Markdown Tables. LLMs are trained heavily on Markdown (from GitHub READMEs) and can reason about them exceptionally well.
- LlamaParse is currently the state-of-the-art for this. It utilizes a trained model just to detect table borders and reconstruct the grid structure before generating text.
4. Chunking Strategies: Cutting the Cake
Once you have text, you must split it.
4.1 Recursive Character Splitting
The default.
- Split by Paragraph
\n\n. - If too big, split by Sentence
.. - If too big, split by Word ` `.
- Overlap: Always keep 50-100 tokens of overlap so sentences aren’t cut in half.
4.2 Semantic Chunking
Instead of splitting by size, split by Meaning.
- Embed every sentence.
- Calculate cosine similarity between S1 and S2.
- If similarity drops below a threshold (e.g., 0.7), start a new chunk.
- Result: You get coherent “Topics”.
5. Multi-Modal RAG
What about charts? A Vector Database cannot “search” a bar chart.
5.1 Pattern: Image-to-Text Indexing
- Extraction: Detect images in the PDF. Crop them.
- Captioning: Send the image to a Vision Model. “Describe this chart in detail, including data points.”
- Result: “A bar chart showing Q3 Sales rising by 20% to $12M.”
- Embedding: Embed the Caption (Text) into the Vector DB.
- Storage: Store the original Image path in metadata.
5.2 Retrieval Flow
- User: “Did sales go up?”
- Search matches the Caption (“Sales rising…”).
- Agent retrieves the caption and says “Yes, sales rose 20%.”
- Optional: Agent displays the original image to the user.
6. Code: Modern Parsing Pipeline (Conceptual)
How an “Agentic Ingestion Pipeline” looks in pseudocode.
def ingest_document(file_path):
# 1. Routing
if is_scanned(file_path):
mode = "vision"
else:
mode = "text"
# 2. Parsing (LlamaParse or Unstructured)
text = parser.parse(file_path, mode=mode, output_format="markdown")
# 3. Image Extraction
images = extract_images(file_path)
for img in images:
caption = vision_model.caption(img)
text += f"\n\n"
# 4. Semantic Chunking
chunks = semantic_chunker(text)
# 5. Indexing
vector_db.add(chunks)
7. Summary
Document Processing determines the Ceiling of your agent’s intelligence.
- Text Extraction: Use
pypdffor simple text,LlamaParsefor everything else. - Tables: Must be converted to Markdown.
- Charts: Must be captioned by Vision models.
To scale this to millions of documents, we must master Vector Search Algorithms like HNSW and quantization.