Parsing Best Practices | Extend Documentation

This guide covers practical tips for getting the best results from Parse: tuning for retrieval quality, accuracy, speed, and cost.

Use section-based chunking for RAG

By default, Parse returns one chunk per page. For retrieval, you usually want chunks that are complete semantic units — so each one can be embedded and retrieved on its own without splitting a table or cutting off a paragraph mid-thought.

Setting chunkingStrategy.type to section splits the document at logical boundaries (headings, tables, figures) instead of arbitrary page breaks. Use minCharacters and maxCharacters to keep chunks within your embedding model’s ideal range.

1 {
2   "config": {
3     "target": "markdown",
4     "chunkingStrategy": {
5       "type": "section",
6       "options": { "minCharacters": 500, "maxCharacters": 2000 }
7     }
8   }
9 }

Section chunking requires target: "markdown". See Chunking strategy for the full options.

Use HTML for complex tables

Markdown tables can’t represent merged cells, nested headers, or multi-row cells — so complex tables often come out misaligned. Set blockOptions.tables.targetFormat to html to preserve the original table structure.

1 {
2   "config": {
3     "blockOptions": { "tables": { "targetFormat": "html" } }
4   }
5 }

If your tables look broken, switching to html usually fixes it. For tables that span multiple pages, also enable tableHeaderContinuationEnabled so headers repeat on each page. See Tables.

Enable agentic processing only when needed

Agentic processing uses a vision language model to review and correct parsing output. It meaningfully improves accuracy on hard documents, but it adds latency and consumes more credits — so it’s a deliberate accuracy-vs-speed-and-cost tradeoff, not an always-on setting. It’s off by default; enable it only where it helps:

text.agentic — corrects low-confidence OCR. Enable for handwriting, faded or skewed scans, unusual fonts, or when you see garbled characters in the output.
tables.agentic — reviews and fixes table structure. Enable for tables with misaligned columns, merged cells, or values landing in the wrong column.

Clean, simple PDFs may parse accurately without it — test your document types with it off first, then enable it selectively.

1 {
2   "config": {
3     "blockOptions": {
4       "text": { "agentic": { "enabled": true } },
5       "tables": { "agentic": { "enabled": true } }
6     }
7   }
8 }

See Text and Tables.

Performance optimization

Parse defaults favor accuracy. Adjust these settings to trade some accuracy for speed and cost, or the reverse.

For fastest, lowest-cost processing:

Setting	Why it helps
`blockOptions.text.agentic.enabled: false`	Skips VLM-based OCR correction — the single biggest latency saver.
`blockOptions.figures.enabled: false`	Skips figure classification and summarization.
`advancedOptions.pageRotationEnabled: false`	Skips rotation detection when pages are already upright.
`chunkingStrategy.type: "page"` or `"document"`	Avoids the extra work of computing semantic sections.
`engine: "parse_light"`	Faster, lower-cost engine for high-volume ingestion; still handles layout well, with some accuracy trade-off on hard scans, handwriting, large tables, and dense checkboxes.

For highest accuracy:

Setting	Why it helps
`engine: "parse_performance"`	Best handling of complex layouts, tables, and scans (default).
`target: "markdown"` + `chunkingStrategy.type: "section"`	Clean reading order with complete semantic chunks.
`blockOptions.text.agentic.enabled: true`	Corrects low-confidence OCR and handwriting.
`blockOptions.tables.tableHeaderContinuationEnabled: true`	Repeats headers across multi-page tables.
`blockOptions.text.signatureDetectionEnabled: true`	Detects signatures in legal documents.

Troubleshooting

Symptom	What to try
Poor OCR or garbled text	Enable `blockOptions.text.agentic.enabled`; try `target: "spatial"` for very messy or skewed scans.
Chunks are too large or too small	Tune `minCharacters` / `maxCharacters` on `section` chunking, or switch the chunking `type` (`page` / `section` / `document`).
Tables look broken	Set `blockOptions.tables.targetFormat: "html"`; enable `tableHeaderContinuationEnabled` for multi-page tables.
Processing is too slow	See Performance optimization.
Large document times out	Move off the synchronous endpoint — see Move to production with async processing.

Recipes

Ready-to-run config blocks for common scenarios (comments added to explain each setting). For field-level options and defaults, see Configuration.

Optimized for a RAG pipeline

1 {
2   "config": {
3     "target": "markdown",
4     "chunkingStrategy": {
5       "type": "section",  // complete semantic units to embed and retrieve
6       "options": { "minCharacters": 500, "maxCharacters": 10000 }
7     },
8     "blockOptions": {
9       "figures": { "enabled": true, "figureImageClippingEnabled": false },  // summarize charts/diagrams, skip image exports
10       "tables": { "targetFormat": "html" }  // preserve complex table structure
11     }
12   }
13 }

Low-latency, cost-optimized

1 {
2   "config": {
3     "target": "markdown",
4     "chunkingStrategy": { "type": "page" },  // skip semantic section computation
5     "blockOptions": {
6       "figures": { "enabled": false },  // skip figure analysis (latency + credits)
7       "tables": { "targetFormat": "markdown" },  // lighter than html
8       "text": { "signatureDetectionEnabled": false, "agentic": { "enabled": false } }  // skip VLM OCR correction
9     },
10     "advancedOptions": { "pageRotationEnabled": false }  // skip rotation detection when pages are upright
11   }
12 }

Complex legal documents

1 {
2   "config": {
3     "target": "markdown",
4     "chunkingStrategy": { "type": "section" },
5     "blockOptions": {
6       "figures": { "enabled": true, "figureImageClippingEnabled": true },
7       "tables": { "targetFormat": "html", "tableHeaderContinuationEnabled": true },  // keep headers on multi-page tables
8       "text": { "signatureDetectionEnabled": true, "agentic": { "enabled": true } }  // catch signatures + fix tricky text
9     }
10   }
11 }

Handwritten forms

1 {
2   "config": {
3     "target": "spatial",  // preserve layout when reading order is unreliable
4     "chunkingStrategy": { "type": "page" },
5     "blockOptions": {
6       "figures": { "enabled": false },
7       "text": { "signatureDetectionEnabled": true, "agentic": { "enabled": true } }  // VLM correction for handwriting
8     }
9   }
10 }

Move to production with async processing

The quick start and the examples above use the synchronous /parse endpoint — the fastest way to try Parse and iterate on config. When you’re ready to run Parse in production, switch to the asynchronous /parse_runs endpoint.

Why the sync /parse endpoint isn’t built for production:

It has a 5-minute timeout — large or complex documents can exceed it and fail.
It holds a connection open for the entire parse, which is brittle for big files and bursty traffic.
There’s no delivery mechanism — you can’t receive a webhook when the run finishes; you only get the result if the request stays open.
It’s intended for onboarding and testing, not sustained workloads.

What async (/parse_runs) gives you:

Reliable handling of large documents without timeouts.
Results via polling (GET /parse_runs/{id}) or webhooks, so you’re not holding connections open.
A better fit for batch and high-volume pipelines.

See Async Processing for the full comparison, polling options, and webhook setup.