Extraction Overview | Extend Documentation

Extraction pulls the exact fields you need out of a document and returns them as structured JSON. You define a schema describing the fields you want, and Extend returns each value with confidence scores and citations back to the source document. Use it to turn receipts, invoices, statements, contracts, and forms into reliable, machine-readable data.

How Extract works

Extract runs Parse under the hood: it first turns the document into clean, structured content, then uses AI to locate and return only the fields your schema defines, even across complex tables and layouts.

That targeting is what separates the two endpoints. Parse gives you everything in the document as structured chunks with positions and types — the right choice for RAG, document viewers, or feeding full context to an LLM. Extract gives you back only the specific fields you asked for, shaped like your schema.

Because Extract reads from what Parse produces, it can only return what Parse sees. If a value never makes it into the parsed content — an OCR miss or a mangled table — no amount of schema tweaking will pull it out. When a field comes back empty or is incorrect, confirm it appears in the Parse output correctly first, then refine the schema or parseConfig depending on whether the parse output is the issue or not.

Quick start

We’ll pull a few key fields from a bill of lading. For this quick-start we’ve uploaded the file here.

Grab a key from the Developers page and store it as the EXTEND_API_KEY environment variable. If you’re using an SDK, see the installation instructions.

$ export EXTEND_API_KEY="your_api_key_here"

The /extract endpoint takes a file and a config with the schema you want to pull.

Python

TypeScript

Java

Go

cURL

1 from extend_ai import Extend
2 
3 client = Extend()
4 
5 result = client.extract(
6     file={
7         "url": "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf",
8     },
9     config={
10         "schema": {
11             "type": "object",
12             "properties": {
13                 "load_number": {
14                     "type": ["string", "null"],
15                     "description": "The load number on the bill of lading.",
16                 },
17                 "shipper_name": {
18                     "type": ["string", "null"],
19                     "description": "The name of the shipper or origin company.",
20                 },
21                 "consignee_name": {
22                     "type": ["string", "null"],
23                     "description": "The name of the consignee or destination company.",
24                 },
25                 "ship_date": {
26                     "type": ["string", "null"],
27                     "description": "The date the shipment was picked up.",
28                     "extend:type": "date",
29                 },
30             },
31         },
32         "advancedOptions": {"citationsEnabled": True},
33     },
34 )
35 
36 print(result)

Want to extract from your own document? Upload it first, then pass the returned file id instead of a url (reusing the same config).

Python

TypeScript

Java

Go

cURL

1 with open("bill_of_lading.pdf", "rb") as f:
2     uploaded = client.files.upload(file=f)
3 
4 result = client.extract(file={"id": uploaded.id}, config=config)

Example response

After you run the code snippet above, you’ll see a response like this. Extend parses the document, locates each field you described, and returns an output with two halves: value (your data, shaped like the schema) and metadata (per-field confidence and citations, keyed by the same field paths).

1 {
2   "object": "extract_run",
3   "id": "exr_3f1j6I1gsw5k96xFiCnkM",
4   "status": "PROCESSED",
5   "output": {
6     "value": {
7       "load_number": "ABC-10025521",
8       "shipper_name": "Acme Manufacturing Co.",
9       "consignee_name": "Northwind Distribution LLC",
10       "ship_date": "2026-03-14"
11     },
12     "metadata": {
13       "load_number": {
14         "logprobsConfidence": 1,
15         "ocrConfidence": 0.99,
16         "citations": [
17           {
18             "page": { "number": 1, "width": 612, "height": 792 },
19             "polygon": [
20               { "x": 56.8, "y": 35.2 },
21               { "x": 162.2, "y": 35.2 },
22               { "x": 162.2, "y": 48.1 },
23               { "x": 56.8, "y": 48.1 }
24             ],
25             "referenceText": "Load No. ABC-10025521"
26           }
27         ]
28       },
29       "shipper_name": { "logprobsConfidence": 0.98, "ocrConfidence": 0.97, "citations": [...] },
30       "consignee_name": { "logprobsConfidence": 0.98, "ocrConfidence": 0.96, "citations": [...] },
31       "ship_date": { "logprobsConfidence": 0.99, "ocrConfidence": 0.98, "citations": [...] }
32     }
33   }
34 }

Key fields

Field	What it contains
`output.value`	Your extracted data, shaped exactly like the `schema` you defined.
`output.metadata`	Per-field details keyed by field path (for example `line_items[0].description`).
`metadata[field].logprobsConfidence`	Model confidence for the field, from `0` to `1`.
`metadata[field].ocrConfidence`	OCR confidence for the underlying text, from `0` to `1`.
`metadata[field].citations`	Bounding-box references back to the source document (when `citationsEnabled`).

For full request/response details, see the Create Extract Run API reference.

Use the output

Read your data straight off output.value, and use output.metadata to trust high-confidence values and route the rest to review.

Python

TypeScript

Java

Go

1 # Read the extracted values, shaped like your schema
2 value = result.output.value
3 print("Load number:", value["load_number"])
4 
5 # Use per-field OCR confidence to decide what to trust
6 for field, meta in result.output.metadata.items():
7     if (meta.ocr_confidence or 0) < 0.9:
8         print(f"Low confidence on {field} — route to review")

For the full shape, including confidence scores, citations, and insights, see Response Format.

Schema-less extraction (infer schema at runtime)

Don’t have a schema yet? You can omit config entirely (or pass config without a schema) and Extend will automatically infer a schema from the document before extracting. This is useful for exploring unfamiliar document types, prototyping, or any case where you want structured data without defining the shape upfront.

$ curl -X POST https://api.extend.ai/extract_runs \
>   -H "x-extend-api-version: 2026-02-09" \
>   -H "Authorization: Bearer $EXTEND_API_KEY" \
>   -H "Content-Type: application/json" \
>   -d '{
>     "file": { "url": "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf" }
>   }'

Use extractionRules to guide what the inferred schema focuses on:

Python

TypeScript

Java

Go

cURL

1 result = client.extract_runs.create(
2     file={
3         "url": "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf",
4     },
5     config={
6         "extractionRules": "This is a freight bill of lading. We are most concerned with shipment tracking and logistics fields: shipper name and address, consignee name and address, carrier name, PRO number, load number, pickup date, delivery date, and total weight.",
7     },
8 )

Once complete, config.schema in the response reflects the inferred schema so you can inspect or save it.

file.base64 is not supported when using an inferred schema. Use a url, a previously-uploaded file id, or raw text instead.

Sync vs async

The quick start above calls the synchronous /extract endpoint. We also have an asynchronous /extract_runs endpoint that should be used for large files and high volume use cases.

See Async Processing for the full comparison, polling options, and webhook setup.

Save it as a processor

The quick start runs with an inline config, which is perfect for getting started. To reuse a configuration across runs — and to version it, measure its accuracy, and optimize it — save it as an extractor, a kind of processor. Processors are the saved entities you iterate on in the dashboard, run evaluation sets against, and improve with Composer.

Configuration

The quick start sends just file and config.schema. To control how extraction runs, pass more options inside config. Here are the most commonly used ones; for the full reference, see Configuration.

Schema

The schema is the heart of every extraction — a JSON Schema describing the fields you want. Add array properties to pull repeating rows like line items, nest object properties for grouped data, and use extend:type for typed fields like dates and currency.

1 {
2   "config": {
3     "schema": {
4       "type": "object",
5       "properties": {
6         "invoice_number": { "type": ["string", "null"], "description": "The invoice number." }
7       }
8     }
9   }
10 }

Full schema reference →

Base processor

Choose the processor based on your accuracy and latency needs.

1 { "config": { "baseProcessor": "extraction_performance" } }

Processor	When to use
`extraction_performance`	Highest accuracy across complex layouts and tables (default).
`extraction_light`	Faster, cheaper extraction for simple documents. Does not return `logprobsConfidence`.

Extraction rules

Steer the model with plain-language rules — useful for disambiguating fields, setting formats, or encoding business logic.

1 {
2   "config": {
3     "extractionRules": "If multiple totals appear, use the grand total. Return all dates in ISO 8601 format."
4   }
5 }

Citations

Return a bounding-box citation and source text for each field, so you can highlight where every value came from.

1 { "config": { "advancedOptions": { "citationsEnabled": true } } }

Generating citations uses an additional citation-focused model, which adds a moderate increase in latency. See Response Format.

Parse config

Because Extract runs Parse under the hood, you can tune how the document is parsed before extraction with parseConfig. Reach for this when a value isn’t being read correctly — for example, enabling agentic OCR for messy scans.

1 {
2   "config": {
3     "parseConfig": {
4       "blockOptions": { "text": { "agentic": { "enabled": true } } }
5     }
6   }
7 }

For every option, including advanced options, the review agent, and Excel settings, see the Configuration reference.

Multifile extraction

Both /extract and /extract_runs support extracting from multiple documents in a single run. Pass a package object instead of file:

1 {
2   "extractor": { "id": "ex_abc123" },
3   "package": {
4     "files": [
5       { "url": "https://example.com/doc1.pdf" },
6       { "url": "https://example.com/doc2.pdf" }
7     ]
8   }
9 }

The model sees all files together and returns one output.value covering the full corpus. The response includes files (an ordered array of file summaries) instead of file.

See Multifile Extraction for the full guide, including constraints, response shape, and how it compares to batch processing.

Next steps

Configuration

Property types, schema generation, and the full configuration reference.

Response Format

The value and metadata objects, confidence scores, and citations.

Schema

Define objects, arrays, enums, and typed fields.

Best Practices

Field naming, prompt crafting, and accuracy tuning.