1. Document Ingestion and Parsing: Garbage In, Garbage Out
Every RAG pipeline failure trace eventually leads back to ingestion. PDFs rendered from scanned images without OCR, HTML scraped with boilerplate navigation intact, DOCX files with tables flattened into strings — these are the inputs your embedding model and chunker will see, and they will produce poor representations that no amount of downstream tuning can fix.
For PDFs, use Unstructured.io or LlamaParse for layout-aware parsing. Both identify headers, tables, figures, and footnotes separately rather than dumping everything into a single text stream. LlamaParse's premium tier (as of June 2026, $3/1,000 pages) is worth the cost for financial or legal documents where table fidelity matters for accuracy.
For HTML, strip navigation, footers, ads, and script tags before passing to the chunker. A simple BeautifulSoup pass targeting the main content element reduces token waste by 30-60% on most web sources. For structured data like databases or spreadsheets, render each row or record as a self-contained natural-language sentence before embedding — 'Customer Acme Corp placed order #4492 for $12,400 on 2026-05-15' retrieves better than raw CSV cells.
Metadata extraction at parse time is often skipped and always regretted. Store document title, source URL, section header, date, author, and content type as filterable fields in your vector store. This lets retrieval apply hard filters before running vector similarity — 'only search documents from the legal department published after 2025-01-01' — which improves precision dramatically without touching the embedding layer.