The Extract API accepts a config object that controls how documents are processed and how values are returned. Configuration options are organized into several categories:
For default values and the full schema, see the Create Extract Run API reference.
Prefer a UI? Extend Studio lets you configure an extractor visually and export the config JSON.
schemaType: JSON Schema object (required)
Defines the fields to extract and their shape. The root must be an object; each property describes a field you want returned in output.value. Add array properties for repeating rows, nest object properties for grouped data, and use extend:type for typed fields like dates, currency, and signatures.
For the full reference — objects, arrays, enums, and custom types — see Schema.
baseProcessorType: "extraction_performance" | "extraction_light" (default: "extraction_performance")
Selects the model family that powers extraction.
baseVersionType: string
Pins the run to a specific version of the selected processor. If omitted, the latest stable version is used. See the Extraction Performance versions page for the changelog.
extractionRulesType: string
Plain-language rules that steer the model — useful for disambiguating fields, setting formats, or encoding business logic. Applied across the whole extraction.
advancedOptions.citationsEnabledType: boolean
Returns spatial (bounding-box) references and source text for each extracted value. Useful for highlighting and validation in review interfaces, but adds processing overhead. See Citations for the response shape.
Generating citations uses an additional citation-focused model, which adds a moderate increase in latency. Disable it in latency-critical pipelines that don’t need spatial references.
advancedOptions.citationModeType: "line" | "word" | "block" (default: "line")
Controls the granularity of each citation. Requires citationsEnabled: true and a base processor version that supports bounding-box citations.
line_items.total).advancedOptions.arrayCitationStrategyType: "item" | "property"
Granularity for citations on array fields. Requires citationsEnabled: true and extraction_performance ≥ 4.4.0 for property-level citations.
advancedOptions.advancedMultimodalEnabledType: boolean
Uses vision-language models to better understand visual elements in the document. Essential for scanned documents, handwritten content, checks and forms, and poor-quality images. It adds latency, so disable it for clean digital PDFs, text-only documents, and latency-critical workflows where visual understanding isn’t required.
advancedOptions.modelReasoningInsightsEnabledType: boolean
Returns the model’s reasoning for each field as reasoning entries in the metadata insights array. Useful for debugging and validation during development; consider disabling it in production to reduce overhead. See Insights.
advancedOptions.reviewAgent.enabledType: boolean
When enabled, an automated agent reviews each extracted value and adds a reviewAgentScore (1–5) to the field’s metadata, plus issue and review_summary insights that flag fields needing manual review. See Review Agent.
advancedOptions.currentDateEnabledType: boolean (default: false)
Includes the current date as context for the model during extraction.
advancedOptions.arrayStrategy.typeType: "large_array_heuristics" | "large_array_max_context" | "large_array_overlap_context"
Controls how very large arrays (for example, hundreds of line items across many pages) are extracted and merged. Omit arrayStrategy for the default behavior; set it only for large-array use cases. If you’re unsure which to use, reach out to the Extend team.
Extract breaks large documents into chunks, extracts from each, and merges the results. These options tune that process.
advancedOptions.chunkingOptions.chunkingStrategyType: "standard" | "semantic"
advancedOptions.chunkingOptions.pageChunkSizeType: integer
The number of pages per chunk (25 by default). Larger chunks mean fewer processing calls and less overhead; smaller chunks can lower latency for large-array extraction.
advancedOptions.chunkingOptions.chunkSelectionStrategyType: "intelligent" | "confidence" | "take_first" | "take_last"
When the same field is found in multiple chunks, this decides which value wins.
advancedOptions.chunkingOptions.customSemanticChunkingRulesType: string
Custom rules to guide semantic chunking.
Large tables can shrink the effective chunk size when chunking by page. To preserve context across a long table, try intelligent merging (chunkSelectionStrategy: "intelligent") and enable table header continuation in parseConfig (see Parse config).
advancedOptions.pageRangesType: Array<{ start: number, end: number }>
Limits extraction to specific pages. Page numbers are 1-based and inclusive; ranges can overlap or arrive out of order (the platform merges and sorts them). Use it when the relevant data is consistently on known pages of a long document — it reduces processing time and cost.
advancedOptions.excelSheetSelectionStrategyType: "intelligent" | "all" | "first" | "last"
Chooses which sheets to extract from a workbook.
advancedOptions.excelSheetRangesType: Array<ExcelSheetRange>
Restricts extraction to specific sheet-index ranges.
parseConfigType: Parse config object
Because Extract runs Parse under the hood, you can tune how the document is parsed before extraction with parseConfig. It accepts the same options as the Parse API — figure parsing, signature detection, agentic OCR, formula parsing, table formatting, and the parse engine. Reach for this when a value isn’t being read correctly (for example, enabling agentic OCR for messy scans).
For every parse option, see the Parse Configuration reference.
To reuse a configuration across runs and workflows, create an Extractor and reference it by id instead of inlining config each time. You can override specific fields per run with overrideConfig.
An extractor is a kind of processor — see that page for how saving a configuration lets you version, evaluate, and optimize it.
extractor.overrideConfig.