Best Practices: Advanced Options

Extend offers several advanced configuration options to optimize extraction performance, accuracy, and cost. This guide provides detailed explanations of each option.

For a quick guide to reducing latency, see Latency Optimization.

Extraction Performance vs. Light

Extraction Performance vs. Light options

Extraction Performance

Best for: Complex documents, high accuracy requirements, multimodal content

Characteristics:

  • Higher accuracy for complex layouts
  • Better handling of handwritten content
  • More sophisticated reasoning capabilities
  • Parses documents as markdown for better performance
1config: {
2 "type": "EXTRACT",
3 "baseProcessor": "extraction_performance"
4}

Extraction Light

Best for: High-volume processing, cost-sensitive applications, simple document types

Characteristics:

  • Faster processing
  • Lower cost per processor run
  • Good accuracy for straightforward extractions
  • Removes support for advanced visual features (e.g. figure parsing, signature detection, page rotation)
1config: {
2 "type": "EXTRACT",
3 "baseProcessor": "extraction_light"
4}

Core Performance Settings

Core performance settings in Extend Studio

Bounding Box Citations

Bounding box citations provide spatial location references for extracted values. While useful for highlighting and validation in review interfaces, they add processing overhead.

For more details, see the Citations documentation.

Configuration:

1config: {
2 "type": "EXTRACT",
3 "advancedOptions": {
4 "citationsEnabled": false
5 }
6}

Advanced Multimodal

Advanced multimodal processing uses vision-language models to better understand visual elements in documents. While this adds latency, it is essential for:

  • Scanned documents
  • Handwritten content
  • Checks and forms
  • Poor quality images

Disable for:

  • Clean, digital PDFs
  • Documents containing primarily text
  • Latency-critical workflows where visual understanding is not required

Configuration:

1config: {
2 "type": "EXTRACT",
3 "advancedOptions": {
4 "advancedMultimodalEnabled": false
5 }
6}

Model Reasoning Insights

Model reasoning insights provide explanations for the model’s decision-making process, which adds processing overhead. These are primarily useful for debugging and validation during development, and can be disabled in production.

Configuration:

1config: {
2 "type": "EXTRACT",
3 "advancedOptions": {
4 "modelReasoningInsightsEnabled": false
5 }
6}

Array Strategies

Array strategies control how large arrays are extracted and merged across document chunks. The default (none) uses standard extraction behavior.

Array strategy options in Extend Studio

StrategyLatencyDescription
noneStandardDefault behavior. Arrays merged using intelligent (Performance) or confidence (Light) merging.
large_array_heuristicsLowerOptimized for very large arrays where latency matters. Uses simpler merging logic.
large_array_max_contextHigherMultiple passes through the document for maximum accuracy. Doubles credit cost.
large_array_overlap_contextMediumMaintains surrounding page context for each chunk to eliminate context loss at boundaries.

Page Ranges

Page ranges settings in Extend Studio

For documents where you only need to extract from specific pages, limiting the page range reduces processing time by skipping unnecessary content.

Configuration:

1config: {
2 "type": "EXTRACT",
3 "advancedOptions": {
4 "pageRanges": [
5 { "start": 1, "end": 5 }
6 ]
7 }
8}

Use when:

  • Relevant data is consistently located on specific pages
  • Processing long documents where only a portion is needed
  • Standardized document formats with predictable layouts

Chunking & Merging

Chunking and merging are essential pre-processing steps that optimize document processing by breaking large documents into manageable pieces and intelligently combining related content.

Chunking Strategy

Chunking options in Extend Studio

  • Standard: Page-based chunking with heuristics (reduces chunk size for large tables, etc.)
  • Semantic: Uses AI to intelligently determine if pages can be split without breaking content relationships

Configuration:

1config: {
2 "type": "EXTRACT",
3 "advancedOptions": {
4 "chunkingOptions": {
5 "chunkingStrategy": "standard"
6 }
7 }
8}

Chunk Types

  • Section: Splits by logical sections (headings, subheadings). Preserves content structure.
  • Page: Groups by pages (25 by default). Standard chunking setting, works for most documents.
  • Document: Treats entire document as single chunk. Fastest for non-array extraction.

* The default page chunk size can be smaller than 25 if large tables are present

Chunk Size

Chunk size settings in Extend Studio

The optimal chunk size depends on your extraction type:

  • For large array extraction: Decreasing chunk size lowers latency by reducing intelligent chunking/merging overhead
  • For non-array extraction: Setting chunk type to document is fastest as it skips intelligent merging entirely
  • General rule: Larger chunks (20-25 pages) mean fewer processing calls and less overhead

Configuration:

1config: {
2 "type": "EXTRACT",
3 "advancedOptions": {
4 "chunkingOptions": {
5 "pageChunkSize": 25
6 }
7 }
8}

Merging Strategy

Merging strategy settings in Extend Studio

When the same field is extracted from multiple chunks, the merging strategy determines which value to use.

StrategySpeedDescription
intelligentSlowestUses an additional LLM call to analyze all extracted values and select the most accurate one based on document context.
confidenceFastSelects the value with the highest confidence score.
take_firstFastestTakes the first non-null value found (earliest page). Best when authoritative values appear at the document start.
take_lastFastestTakes the last non-null value found (latest page). Best when authoritative values appear at the document end.

Configuration:

1config: {
2 "type": "EXTRACT",
3 "advancedOptions": {
4 "chunkingOptions": {
5 "chunkSelectionStrategy": "confidence"
6 }
7 }
8}

Chunking Tips & Common Issues

Table Splitting: Large tables can reduce the default chunk size, especially when chunking by page. Test these options to preserve context across large tables:

OptionSetting
Merging strategyIntelligent
Chunk typeSection
Parsing optionsEnable table header continuation

Parser Configuration

Parser Block Options

Parser block options in Extend Studio

Figure Parsing: Converts charts, diagrams, and images into text descriptions that extraction can read. Disable if your documents don’t contain important visual elements.

Signature Detection: Detects signatures on documents and determines whether they’re signed. Disable if signature verification is not needed.

Agentic OCR: Uses AI to fix OCR mistakes, especially for handwritten text or poor-quality scans. Adds processing time and cost. Keep disabled unless processing handwriting or poor scan quality.

Parse Engine

Parse engine options in Extend Studio

The parse engine controls how documents are processed. Performance is the default and recommended for most use cases. Light is cheaper and slightly faster, but does not support all parsing features such as markdown and advanced table parsing.

Use Light when:

  • Processing standard digital documents
  • Tables are simple or not critical
  • Cost and speed are priorities

Avoid Light when:

  • Documents require advanced table parsing
  • Complex multi-column layouts
  • Markdown output is needed

Parallel Extractors

For large extractions or schemas, consider breaking a single extractor into multiple extractors that run in parallel. This is particularly effective when you have both simple top-level fields and complex array extractions.

Workflow showing a financial document split into two parallel extractors

The workflow above shows an example where a financial document is split into two parallel extractors: one for high-level fields, and one for the financial line item details. These run in parallel and are later combined in the workflow output.

Use when:

  • Documents have both simple fields and complex arrays
  • Array extraction is significantly slower than other fields
  • Total latency is critical to your use case