For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Book a demoLog in
DocumentationAPI ReferenceModel VersioningChangelog
DocumentationAPI ReferenceModel VersioningChangelog
    • Studio
    • Support
    • Benchmarks
    • Status
  • Getting Started
    • Overview
    • API Quickstart
    • Dashboard Quickstart
    • Agent Quickstart
  • Dev Tools
    • SDKs
    • CLI
  • Capabilities
      • Overview
      • Configuration
      • Response Format
      • Schema
      • Confidence Scores
      • Review Agent
LogoLogo
Book a demoLog in
On this page
  • Schema
  • schema
  • Base processor
  • baseProcessor
  • baseVersion
  • Extraction rules
  • extractionRules
  • Advanced options
  • Citations
  • advancedOptions.citationsEnabled
  • advancedOptions.citationMode
  • advancedOptions.arrayCitationStrategy
  • Multimodal
  • advancedOptions.advancedMultimodalEnabled
  • Reasoning insights
  • advancedOptions.modelReasoningInsightsEnabled
  • Review Agent
  • advancedOptions.reviewAgent.enabled
  • Current date
  • advancedOptions.currentDateEnabled
  • Large arrays
  • advancedOptions.arrayStrategy.type
  • Chunking and merging
  • advancedOptions.chunkingOptions.chunkingStrategy
  • advancedOptions.chunkingOptions.pageChunkSize
  • advancedOptions.chunkingOptions.chunkSelectionStrategy
  • advancedOptions.chunkingOptions.customSemanticChunkingRules
  • Page ranges
  • advancedOptions.pageRanges
  • Excel
  • advancedOptions.excelSheetSelectionStrategy
  • advancedOptions.excelSheetRanges
  • Parse config
  • parseConfig
  • Using a saved extractor
CapabilitiesExtraction

Configuration

Was this page helpful?
Previous

Response Format

Next
Built with

The Extract API accepts a config object that controls how documents are processed and how values are returned. Configuration options are organized into several categories:

  • Schema: The JSON Schema describing the fields to extract (required).
  • Base processor: The model family that powers extraction (accuracy vs. speed and cost).
  • Extraction rules: Natural-language guidance for the model.
  • Advanced options: Citations, multimodal processing, large-array handling, chunking, the Review Agent, and Excel.
  • Parse config: How the document is parsed before extraction.

For default values and the full schema, see the Create Extract Run API reference.

Prefer a UI? Extend Studio lets you configure an extractor visually and export the config JSON.


Schema

schema

Type: JSON Schema object (required)

Defines the fields to extract and their shape. The root must be an object; each property describes a field you want returned in output.value. Add array properties for repeating rows, nest object properties for grouped data, and use extend:type for typed fields like dates, currency, and signatures.

1{
2 "config": {
3 "schema": {
4 "type": "object",
5 "properties": {
6 "invoice_number": { "type": ["string", "null"], "description": "The invoice number." }
7 }
8 }
9 }
10}

For the full reference — objects, arrays, enums, and custom types — see Schema.


Base processor

baseProcessor

Type: "extraction_performance" | "extraction_light" (default: "extraction_performance")

Selects the model family that powers extraction.

ProcessorWhen to use
extraction_performanceBest for complex documents, high accuracy requirements, and multimodal content. Higher accuracy on complex layouts, better handling of handwritten content, more sophisticated reasoning, and parses documents as markdown for better performance. The default.
extraction_lightBest for high-volume processing, cost-sensitive applications, and simple document types. Faster processing and lower cost per run with good accuracy for straightforward extractions, but removes support for advanced visual features (figure parsing, signature detection, page rotation).
1{
2 "config": {
3 "baseProcessor": "extraction_performance"
4 }
5}

baseVersion

Type: string

Pins the run to a specific version of the selected processor. If omitted, the latest stable version is used. See the Extraction Performance versions page for the changelog.

1{
2 "config": {
3 "baseProcessor": "extraction_performance",
4 "baseVersion": "4.6.0"
5 }
6}

Extraction rules

extractionRules

Type: string

Plain-language rules that steer the model — useful for disambiguating fields, setting formats, or encoding business logic. Applied across the whole extraction.

1{
2 "config": {
3 "extractionRules": "If multiple totals appear, use the grand total. Return all dates in ISO 8601 format."
4 }
5}

Advanced options

Citations

advancedOptions.citationsEnabled

Type: boolean

Returns spatial (bounding-box) references and source text for each extracted value. Useful for highlighting and validation in review interfaces, but adds processing overhead. See Citations for the response shape.

Generating citations uses an additional citation-focused model, which adds a moderate increase in latency. Disable it in latency-critical pipelines that don’t need spatial references.

advancedOptions.citationMode

Type: "line" | "word" | "block" (default: "line")

Controls the granularity of each citation. Requires citationsEnabled: true and a base processor version that supports bounding-box citations.

  • line — returns one or more relevant OCR lines per citation (default).
  • word — narrows to the relevant OCR word span when possible. Useful for precise citations from a table cell to an array property (e.g. line_items.total).
  • block — returns block-level polygons (paragraphs, key-value regions, tables). Highest recall, lowest granularity.

advancedOptions.arrayCitationStrategy

Type: "item" | "property"

Granularity for citations on array fields. Requires citationsEnabled: true and extraction_performance ≥ 4.4.0 for property-level citations.

1{
2 "config": {
3 "advancedOptions": {
4 "citationsEnabled": true,
5 "citationMode": "line",
6 "arrayCitationStrategy": "property"
7 }
8 }
9}

Multimodal

advancedOptions.advancedMultimodalEnabled

Type: boolean

Uses vision-language models to better understand visual elements in the document. Essential for scanned documents, handwritten content, checks and forms, and poor-quality images. It adds latency, so disable it for clean digital PDFs, text-only documents, and latency-critical workflows where visual understanding isn’t required.

1{
2 "config": {
3 "advancedOptions": {
4 "advancedMultimodalEnabled": true
5 }
6 }
7}

Reasoning insights

advancedOptions.modelReasoningInsightsEnabled

Type: boolean

Returns the model’s reasoning for each field as reasoning entries in the metadata insights array. Useful for debugging and validation during development; consider disabling it in production to reduce overhead. See Insights.

1{
2 "config": {
3 "advancedOptions": {
4 "modelReasoningInsightsEnabled": true
5 }
6 }
7}

Review Agent

advancedOptions.reviewAgent.enabled

Type: boolean

When enabled, an automated agent reviews each extracted value and adds a reviewAgentScore (1–5) to the field’s metadata, plus issue and review_summary insights that flag fields needing manual review. See Review Agent.

1{
2 "config": {
3 "advancedOptions": {
4 "reviewAgent": { "enabled": true }
5 }
6 }
7}

Current date

advancedOptions.currentDateEnabled

Type: boolean (default: false)

Includes the current date as context for the model during extraction.

1{
2 "config": {
3 "advancedOptions": {
4 "currentDateEnabled": true
5 }
6 }
7}

Large arrays

advancedOptions.arrayStrategy.type

Type: "large_array_heuristics" | "large_array_max_context" | "large_array_overlap_context"

Controls how very large arrays (for example, hundreds of line items across many pages) are extracted and merged. Omit arrayStrategy for the default behavior; set it only for large-array use cases. If you’re unsure which to use, reach out to the Extend team.

StrategyLatency / costDescription
(omit)StandardDefault. Arrays are merged using intelligent (Performance) or confidence (Light) merging.
large_array_heuristicsLowerOptimized for very large arrays where latency matters, using simpler chunking and merging heuristics.
large_array_max_contextHigher (≈2× credits)Multiple passes through the document for maximum accuracy.
large_array_overlap_contextMediumKeeps surrounding page context for each chunk to eliminate context loss at chunk boundaries.
1{
2 "config": {
3 "advancedOptions": {
4 "arrayStrategy": { "type": "large_array_heuristics" }
5 }
6 }
7}

Chunking and merging

Extract breaks large documents into chunks, extracts from each, and merges the results. These options tune that process.

advancedOptions.chunkingOptions.chunkingStrategy

Type: "standard" | "semantic"

  • standard — page-based chunking with heuristics (e.g. reduces chunk size for large tables). Works for most documents.
  • semantic — uses AI to intelligently determine whether pages can be split without breaking content relationships.

advancedOptions.chunkingOptions.pageChunkSize

Type: integer

The number of pages per chunk (25 by default). Larger chunks mean fewer processing calls and less overhead; smaller chunks can lower latency for large-array extraction.

advancedOptions.chunkingOptions.chunkSelectionStrategy

Type: "intelligent" | "confidence" | "take_first" | "take_last"

When the same field is found in multiple chunks, this decides which value wins.

StrategySpeedDescription
intelligentSlowestUses an additional LLM call to pick the most accurate value from document context.
confidenceFastSelects the value with the highest confidence score.
take_firstFastestTakes the first non-null value (earliest page). Best when authoritative values appear at the start.
take_lastFastestTakes the last non-null value (latest page). Best when authoritative values appear at the end.

advancedOptions.chunkingOptions.customSemanticChunkingRules

Type: string

Custom rules to guide semantic chunking.

1{
2 "config": {
3 "advancedOptions": {
4 "chunkingOptions": {
5 "chunkingStrategy": "standard",
6 "pageChunkSize": 25,
7 "chunkSelectionStrategy": "confidence"
8 }
9 }
10 }
11}

Large tables can shrink the effective chunk size when chunking by page. To preserve context across a long table, try intelligent merging (chunkSelectionStrategy: "intelligent") and enable table header continuation in parseConfig (see Parse config).

Page ranges

advancedOptions.pageRanges

Type: Array<{ start: number, end: number }>

Limits extraction to specific pages. Page numbers are 1-based and inclusive; ranges can overlap or arrive out of order (the platform merges and sorts them). Use it when the relevant data is consistently on known pages of a long document — it reduces processing time and cost.

1{
2 "config": {
3 "advancedOptions": {
4 "pageRanges": [
5 { "start": 1, "end": 5 }
6 ]
7 }
8 }
9}

Excel

advancedOptions.excelSheetSelectionStrategy

Type: "intelligent" | "all" | "first" | "last"

Chooses which sheets to extract from a workbook.

advancedOptions.excelSheetRanges

Type: Array<ExcelSheetRange>

Restricts extraction to specific sheet-index ranges.

1{
2 "config": {
3 "advancedOptions": {
4 "excelSheetSelectionStrategy": "intelligent"
5 }
6 }
7}

Parse config

parseConfig

Type: Parse config object

Because Extract runs Parse under the hood, you can tune how the document is parsed before extraction with parseConfig. It accepts the same options as the Parse API — figure parsing, signature detection, agentic OCR, formula parsing, table formatting, and the parse engine. Reach for this when a value isn’t being read correctly (for example, enabling agentic OCR for messy scans).

1{
2 "config": {
3 "parseConfig": {
4 "blockOptions": {
5 "text": { "agentic": { "enabled": true } }
6 }
7 }
8 }
9}

For every parse option, see the Parse Configuration reference.


Using a saved extractor

To reuse a configuration across runs and workflows, create an Extractor and reference it by id instead of inlining config each time. You can override specific fields per run with overrideConfig.

An extractor is a kind of processor — see that page for how saving a configuration lets you version, evaluate, and optimize it.

  • Create an extractor — set up a new extractor with your configuration.
  • Update an extractor — modify an existing extractor’s configuration.
  • Run an extractor — execute an extractor, optionally with extractor.overrideConfig.