The Parse API accepts a config object that controls how documents are parsed and processed. Configuration options are organized into several categories:
For more information on default values and the full schema, see the Parse File API reference.
Prefer a UI? Extend Studio lets you configure the parser visually and export the config JSON.
targetType: "markdown" | "spatial"
Determines how content is extracted and formatted from the document.
When to choose markdown:
When to choose spatial:
Tip: If unsure, prefer
markdown. Only considerspatialin special cases
chunkingStrategy.typeType: "page" | "section" | "document"
Determines the granularity of document chunking.
"page" — Creates a separate chunk for each page of the document. Compatible with both markdown and spatial targets.
"section" — Chunks the document into logical sections based on markdown structure (headings, subheadings). The parser ensures logical groups of content are preserved by never breaking markdown elements across chunks. Only works with target: "markdown". This is ideal for RAG systems where each chunk should be a complete semantic unit.
"document" — Treats the entire document as a single chunk. Use this for small documents or when you have custom downstream chunking requirements.
chunkingStrategy.options.minCharactersType: number
The minimum number of characters per chunk. Small sections may be combined to meet this minimum.
chunkingStrategy.options.maxCharactersType: number
The maximum number of characters per chunk. Long sections will be split at natural boundaries when possible.
Fine-grained control over how specific content types are detected and formatted.
blockOptions.figures.enabledType: boolean
Enables or disables figure detection and parsing. When enabled, the visual languages models will do additional processing, classification and summarization of images, charts, diagrams, and more to analyze and extract content from each figure.
Note: This adds processing latency, especially for documents with many figures. Disable for fastest processing.
blockOptions.figures.figureImageClippingEnabledType: boolean
When enabled, extracts figure images from the document and uploads them to blob storage, providing presigned URLs in the output. Each figure is cropped from the page and saved as a PNG.
blockOptions.tables.targetFormatType: "markdown" | "html"
Controls the output format for tables.
blockOptions.tables.tableHeaderContinuationEnabledType: boolean
When enabled, automatically propagates table headers across page breaks. Useful for long tables spanning multiple pages where headers are only present on the first page.
blockOptions.text.signatureDetectionEnabledType: boolean
Enables advanced signature detection. When enabled, identifies handwritten signatures, initials, and signature blocks in the document.
blockOptions.text.agentic.enabledType: boolean
Enables OCR corrections using vision language models to review and correct low OCR confidence and difficult to parse handwriting. The system automatically identifies text regions with low OCR confidence and applies VLM-based corrections.
When to enable:
Note: Increases latency, especially when every page is scanned.
advancedOptions.pageRotationEnabledType: boolean
Enables automatic page rotation detection and correction. The system detects if pages are rotated and automatically rotates them to the correct orientation before parsing.
advancedOptions.pageRangesType: Array<{ start: number, end: number }>
Specifies which pages of the document to process. Page numbers are 1-based and ranges are inclusive. Ranges can overlap and be in any order—the system automatically merges and sorts them.
[] to process the full document (subject to global limits)advancedOptions.excelParsingModeType: "basic" | "advanced"
Controls how Excel files are parsed.
For .xls files, basic mode is always used.
advancedOptions.excelSkipHiddenContentType: boolean
When enabled, hidden rows, columns, and sheets are excluded from parsed Excel output.
advancedOptions.excelUseRawCellValuesType: boolean
When enabled, returns raw calculated cell values instead of locale-formatted values. For example, a date stored as 45672 would be returned as 45672 instead of 01/15/2025. Useful when downstream processing needs the underlying numeric or unformatted data.
advancedOptions.excelSkipCalculationType: boolean (default: true)
When enabled, skips formula recalculation when opening the workbook. This significantly improves parsing speed for formula-heavy spreadsheets. Disable if you need formulas to be recalculated before parsing (e.g., when cell values depend on volatile functions like NOW() or TODAY()).
advancedOptions.returnOcrOptions for returning raw OCR data in the response.
advancedOptions.returnOcr.wordsType: boolean
When enabled, returns word-level bounding boxes in the response under ocr.words. Each word includes content, boundingBox, confidence (0-1), and pageNumber. Coordinates are in points (1/72 inch) from the top-left corner.
Useful for building document viewers with precise text selection, word-level search highlighting, or OCR quality assessment.
Note: This meaningfully impacts the response size. Consider using the signed URL return format (responseType=url) for large documents.
Response structure:
advancedOptions.alwaysConvertToPdfType: boolean
When enabled, supported file types (images, Word documents, PowerPoint, Excel, HTML) are converted to PDF before parsing. This can improve parsing quality for some file types and ensures spatial output with bounding boxes.
You can specify the response type in the responseType query parameter for the Parse File and Get Parser Run endpoints.
json returns parsed outputs directly in the response body.url returns a presigned URL to the parsed content.Use url for large documents to reduce response payload size, especially when returnOcr.words is enabled.