Parsing Configuration | Extend Documentation

The Parse API accepts a config object that controls how documents are parsed and processed. Configuration options are organized into several categories:

Engine: The base processor that powers parsing (accuracy vs. speed and cost)
Chunking Strategy: How the document is divided into chunks
Block Options: Fine-grained control over parsing specific layout types
Advanced Options: OCR enhancements and page filtering

For more information on default values and the full schema, see the Create Parse Run API reference.

Prefer a UI? Extend Studio lets you configure the parser visually and export the config JSON.

Engine

`engine`

Type: "parse_performance" | "parse_light" (default: "parse_performance")

Selects the base processor that powers parsing.

Processor	When to use
`parse_performance`	The default, and the best choice when accuracy matters most. Prefer Performance for strong checkbox support, more complex tables, better handwriting handling, and robust multilingual handling.
`parse_light`	Best for high-volume, cost-sensitive ingestion. Faster and lower cost per page, and still handles layout fine — but performs worse on things like lower-quality scans, harder handwriting, larger tables, non-Latin-based languages, and dense checkbox regions.

1 {
2   "config": {
3     "engine": "parse_performance"
4   }
5 }

`engineVersion`

Type: string (default: "latest")

Pins the run to a specific version of the selected engine. If omitted, the latest stable version is used. Some features require a minimum version — for example, the new layout model’s block types require parse_performance ≥ 2.0.0. See the Parse Performance and Parse Light version pages for the changelogs.

1 {
2   "config": {
3     "engine": "parse_performance",
4     "engineVersion": "2.0.0"
5   }
6 }

Chunking strategy

`chunkingStrategy.type`

Type: "page" | "section" | "document"

Determines the granularity of document chunking.

"page" — Creates a separate chunk for each page of the document.

"section" — Chunks the document into logical sections based on markdown structure (headings, subheadings). The parser ensures logical groups of content are preserved by never breaking markdown elements across chunks. This is ideal for RAG systems where each chunk should be a complete semantic unit.

"document" — Treats the entire document as a single chunk. Use this for small documents or when you have custom downstream chunking requirements.

1 {
2   "config": {
3     "chunkingStrategy": {
4       "type": "section"
5     }
6   }
7 }

`chunkingStrategy.options.minCharacters`

Type: number

The minimum number of characters per chunk. Small sections may be combined to meet this minimum.

`chunkingStrategy.options.maxCharacters`

Type: number

The maximum number of characters per chunk. Long sections will be split at natural boundaries when possible.

1 {
2   "config": {
3     "chunkingStrategy": {
4       "type": "section",
5       "options": {
6         "minCharacters": 500,
7         "maxCharacters": 2000
8       }
9     }
10   }
11 }

Block options

Fine-grained control over how specific content types are detected and formatted.

Figures

`blockOptions.figures.enabled`

Type: boolean

Enables or disables figure detection and parsing. When enabled, the visual languages models will do additional processing, classification and summarization of images, charts, diagrams, and more to analyze and extract content from each figure.
Note: This adds processing latency, especially for documents with many figures. Disable for fastest processing.

`blockOptions.figures.figureImageClippingEnabled`

Type: boolean

When enabled, extracts figure images from the document and uploads them to blob storage, providing presigned URLs in the output. Each figure is cropped from the page and saved as a PNG.

`blockOptions.figures.advancedChartExtractionEnabled`

Type: boolean

When enabled, runs a specialized vision model on any figure classified as a chart and converts it into a structured data table. The figure block’s content is replaced with the extracted table (formatted according to blockOptions.tables.targetFormat), making chart values available to downstream LLMs and extraction pipelines.

Requires blockOptions.figures.enabled: true (chart classification happens during figure parsing).

Note: Requires engine: "parse_performance" with engineVersion >= "2.0.0". Adds processing latency proportional to the number of charts in the document.

`blockOptions.figures.customInstructions`

Type: string (max 2,000 characters)

Custom instructions injected into the vision model prompt that analyzes and summarizes each figure. Use this to steer figure descriptions toward what matters for your use case — for example, domain-specific terminology, details to always capture from diagrams, or how to describe particular image types.

Requires blockOptions.figures.enabled: true (the instructions are applied during figure parsing).

Note: Requires parse_performance with engineVersion >= "2.0.0" or parse_light with engineVersion >= "1.0.0".

1 {
2   "config": {
3     "blockOptions": {
4       "figures": {
5         "enabled": true,
6         "figureImageClippingEnabled": true,
7         "advancedChartExtractionEnabled": true,
8         "customInstructions": "Describe all medical imaging figures using radiology terminology. Always note any visible measurement scales or annotations."
9       }
10     }
11   }
12 }

Tables

`blockOptions.tables.targetFormat`

Type: "markdown" | "html" — defaults to "html".

Controls the output format for tables.

markdown: Human-readable pipe syntax, works well with LLMs
html: Preserves complex table structure (merged cells, rowspan, colspan)

`blockOptions.tables.tableHeaderContinuationEnabled`

Type: boolean

When enabled, automatically propagates table headers across page breaks. Useful for long tables spanning multiple pages where headers are only present on the first page.

`blockOptions.tables.cellBlocksEnabled`

Type: boolean

When enabled, each table block in the response includes a children array containing per-cell table_cell blocks, with rowIndex, columnIndex, content, and bounding boxes. Useful for building cell-level overlays, targeted highlighting, or custom downstream parsing of table contents.

Note: Requires engine: "parse_performance" with engineVersion >= "2.0.0".

1 {
2   "config": {
3     "blockOptions": {
4       "tables": {
5         "targetFormat": "html",
6         "tableHeaderContinuationEnabled": true,
7         "cellBlocksEnabled": true
8       }
9     }
10   }
11 }

Text

`blockOptions.text.signatureDetectionEnabled`

Type: boolean

Enables advanced signature detection. When enabled, identifies handwritten signatures, initials, and signature blocks in the document.

`blockOptions.text.agentic.enabled`

Type: boolean

Enables OCR corrections using vision language models to review and correct low OCR confidence and difficult to parse handwriting. The system automatically identifies text regions with low OCR confidence and applies VLM-based corrections.

When to enable:

Handwritten documents or forms
Poor quality scans
Historical documents or faded text
Mixed print and handwritten content

Note: Increases latency, especially when every page is scanned.

1 {
2   "config": {
3     "blockOptions": {
4       "text": {
5         "signatureDetectionEnabled": true,
6         "agentic": {
7           "enabled": true
8         }
9       }
10     }
11   }
12 }

Formulas

`blockOptions.formulas.enabled`

Type: boolean

Enables formula detection and parsing. When enabled, mathematical formulas and equations are detected as formula blocks, with their LaTeX representations included in the block details. When disabled, formulas will be detected as text blocks and may not be accurately parsed.

Note: Formula parsing is available on parse_performance v2.0.0 and parse_light v1.0.0.

1 {
2   "config": {
3     "blockOptions": {
4       "formulas": {
5         "enabled": true
6       }
7     }
8   }
9 }

Forms (key-value)

The new layout model in parse_performance v2.0.0 detects multi-column form regions as dedicated key_value blocks. Each form region is then enriched by a VLM into properly ordered markdown of label/value pairs (including checkboxes and signatures).

`blockOptions.keyValue.blankFieldFormattingEnabled`

Type: boolean

When enabled, empty form fields are explicitly annotated with a <blank /> marker in the formatted form content, making it easy to distinguish “field present but unfilled” from “field not present” in downstream extraction or RAG pipelines.

For example, an unchecked field in the source document would render as Date of Birth: <blank /> instead of Date of Birth:.

Note: Requires engine: "parse_performance" with engineVersion >= "2.0.0".

1 {
2   "config": {
3     "blockOptions": {
4       "keyValue": {
5         "blankFieldFormattingEnabled": true
6       }
7     }
8   }
9 }

Barcodes

Barcodes and QR codes are detected as dedicated barcode blocks by the new layout model in parse_performance v2.0.0.

`blockOptions.barcodes.readingEnabled`

Type: boolean

When enabled, decodes detected barcodes and QR codes into their underlying values (e.g. URLs, invoice codes, tracking numbers). The decoded value and barcode format (e.g. "QRCode", "Code128", "DataMatrix", "EAN13") are returned in the block’s details.decodedValue and details.format fields, and the content field contains the decoded value.

`blockOptions.barcodes.imageClippingEnabled`

Type: boolean

When enabled, crops each barcode region from the page, uploads it to blob storage, and returns a presigned URL in details.imageUrl. Useful for displaying or re-processing the original barcode image.

Note: Both options require engine: "parse_performance" with engineVersion >= "2.0.0".

1 {
2   "config": {
3     "blockOptions": {
4       "barcodes": {
5         "readingEnabled": true,
6         "imageClippingEnabled": true
7       }
8     }
9   }
10 }

Advanced options

`advancedOptions.pageRotationEnabled`

Type: boolean

Enables automatic page rotation detection and correction. The system detects if pages are rotated and automatically rotates them to the correct orientation before parsing.

1 {
2   "config": {
3     "advancedOptions": {
4       "pageRotationEnabled": true
5     }
6   }
7 }

`advancedOptions.pageRanges`

Type: Array<{ start: number, end: number }>

Specifies which pages of the document to process. Page numbers are 1-based and ranges are inclusive. Ranges can overlap and be in any order—the system automatically merges and sorts them.

1 {
2   "config": {
3     "advancedOptions": {
4       "pageRanges": [
5         { "start": 1, "end": 3 },
6         { "start": 10, "end": 10 }
7       ]
8     }
9   }
10 }

1-based, inclusive page numbers
Ranges can overlap or arrive out of order; the platform merges and sorts them automatically
Omit the field or pass [] to process the full document (subject to global limits)
You are only billed for pages actually processed

Excel

`advancedOptions.excelParsingMode`

Type: "basic" | "advanced"

Controls how Excel files are parsed.

basic: Fast, deterministic parsing.
advanced: Enable layout block detection for complex spreadsheets.

For .xls files, basic mode is always used.

1 {
2   "config": {
3     "advancedOptions": {
4       "excelParsingMode": "advanced"
5     }
6   }
7 }

`advancedOptions.excelSkipHiddenContent`

Type: boolean

When enabled, hidden rows, columns, and sheets are excluded from parsed Excel output.

1 {
2   "config": {
3     "advancedOptions": {
4       "excelSkipHiddenContent": true
5     }
6   }
7 }

`advancedOptions.excelUseRawCellValues`

Type: boolean

When enabled, returns raw calculated cell values instead of locale-formatted values. For example, a date stored as 45672 would be returned as 45672 instead of 01/15/2025. Useful when downstream processing needs the underlying numeric or unformatted data.

1 {
2   "config": {
3     "advancedOptions": {
4       "excelUseRawCellValues": true
5     }
6   }
7 }

`advancedOptions.excelSkipCalculation`

Type: boolean (default: true)

When enabled, skips formula recalculation when opening the workbook. This significantly improves parsing speed for formula-heavy spreadsheets. Disable if you need formulas to be recalculated before parsing (e.g., when cell values depend on volatile functions like NOW() or TODAY()).

1 {
2   "config": {
3     "advancedOptions": {
4       "excelSkipCalculation": false
5     }
6   }
7 }

`advancedOptions.excelIncludeCellMetadata`

Type: boolean

When enabled, table cell blocks include their source spreadsheet coordinate in details.cellReference (for example, "B2", or "A1:C1" for a merged cell). Formula cells also include details.formula with a leading =. Heading or text blocks derived from spreadsheet cells can include a source range in details.cellReference.

For HTML table output, the same provenance is also preserved as data-cell and data-formula attributes. Markdown output does not include HTML attributes, but the structured block details are still available.

Note: Requires advancedOptions.excelParsingMode: "advanced".

1 {
2   "config": {
3     "advancedOptions": {
4       "excelParsingMode": "advanced",
5       "excelIncludeCellMetadata": true
6     }
7   }
8 }

`advancedOptions.excelIncludeCellFormatting`

Type: boolean

When enabled, table cell blocks include structured formatting (bold, italic, fontColor, backgroundColor) in details.formatting when the source cell has formatting. For HTML table output, the cell’s inline style is also preserved, including colors, bold or italic text, and borders.

Useful when spreadsheet formatting carries meaning for downstream extraction, review, or display.

Note: Requires advancedOptions.excelParsingMode: "advanced". Can increase response size for heavily formatted workbooks.

1 {
2   "config": {
3     "advancedOptions": {
4       "excelParsingMode": "advanced",
5       "excelIncludeCellFormatting": true
6     }
7   }
8 }

`advancedOptions.returnOcr`

Options for returning raw OCR data in the response.

`advancedOptions.returnOcr.words`

Type: boolean

When enabled, returns word-level bounding boxes in the response under output.ocr.words. Each word includes content, boundingBox, confidence (0-1), and pageNumber. Coordinates are in points (1/72 inch) from the top-left corner.

Useful for building document viewers with precise text selection, word-level search highlighting, or OCR quality assessment.

Note: This meaningfully impacts the response size. Consider using the signed URL return format (responseType=url) for large documents.

1 {
2   "config": {
3     "advancedOptions": {
4       "returnOcr": {
5         "words": true
6       }
7     }
8   }
9 }

Response structure:

1 {
2   "object": "parse_run",
3   "output": {
4     "chunks": [...],
5     "ocr": {
6       "words": [
7         {
8           "content": "Invoice",
9           "boundingBox": {
10             "left": 72.5,
11             "top": 50.2,
12             "right": 120.3,
13             "bottom": 65.8
14           },
15           "confidence": 0.99,
16           "pageNumber": 1
17         }
18       ]
19     }
20   }
21 }

`advancedOptions.alwaysConvertToPdf`

Type: boolean

When enabled, supported file types (images, Word documents, PowerPoint, Excel, HTML) are converted to PDF before parsing. This can improve parsing quality for some file types and ensures spatial output with bounding boxes.

1 {
2   "config": {
3     "advancedOptions": {
4       "alwaysConvertToPdf": true
5     }
6   }
7 }

`advancedOptions.formattingDetection`

Type: Array<{ type: "change_tracking" }>

Enables detection of tracked changes (insertions, deletions, substitutions) indicated by strikethroughs, colored text, or underlines in the source document. Detected changes are annotated inline in the content field of applicable blocks (text, heading, section_heading, header, footer).

Changes are represented using semantic HTML elements (<ins> for insertions and <del> for deletions) grouped inside a <change> wrapper. We use <ins>/<del> rather than <s> because they carry explicit change semantics, and the <change> wrapper makes it easy to identify substitutions (a <del> paired with an <ins>).

For example: The total is <change><del>$500.00</del><ins>$450.00</ins></change>.

Note: Requires engine: "parse_performance" with engineVersion >= "2.0.0". This feature is new. If you have feedback on the output format or edge cases, drop us a note in Slack or email support@extend.app.

1 {
2   "config": {
3     "engine": "parse_performance",
4     "engineVersion": "2.0.0",
5     "advancedOptions": {
6       "formattingDetection": [{ "type": "change_tracking" }]
7     }
8   }
9 }

Response type

You can specify the response type in the responseType query parameter for the Create Parse Run and Get Parse Run endpoints.

json returns parsed outputs directly in the response body.
url returns a presigned URL to the parsed content in outputUrl.

Use url for large documents to reduce response payload size, especially when returnOcr.words is enabled.