For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Book a demoLog in
DocumentationAPI ReferenceModel VersioningChangelog
DocumentationAPI ReferenceModel VersioningChangelog
    • Studio
    • Support
    • Benchmarks
    • Status
  • Getting Started
    • Overview
    • API Quickstart
    • Dashboard Quickstart
    • Agent Quickstart
  • Dev Tools
    • SDKs
    • CLI
  • Capabilities
      • Overview
      • Configuration
      • Response Format
      • Schema
      • Confidence Scores
      • Review Agent
LogoLogo
Book a demoLog in
On this page
  • How Extract works
  • Quick start
  • Example response
  • Key fields
  • Use the output
  • Sync vs async
  • Save it as a processor
  • Configuration
  • Schema
  • Base processor
  • Extraction rules
  • Citations
  • Parse config
  • Next steps
CapabilitiesExtraction

Overview

Was this page helpful?
Previous

Configuration

Next
Built with

Extraction pulls the exact fields you need out of a document and returns them as structured JSON. You define a schema describing the fields you want, and Extend returns each value with confidence scores and citations back to the source document. Use it to turn receipts, invoices, statements, contracts, and forms into reliable, machine-readable data.

How Extract works

Extract runs Parse under the hood: it first turns the document into clean, structured content, then uses AI to locate and return only the fields your schema defines, even across complex tables and layouts.

That targeting is what separates the two endpoints. Parse gives you everything in the document as structured chunks with positions and types — the right choice for RAG, document viewers, or feeding full context to an LLM. Extract gives you back only the specific fields you asked for, shaped like your schema.

Because Extract reads from what Parse produces, it can only return what Parse sees. If a value never makes it into the parsed content — an OCR miss or a mangled table — no amount of schema tweaking will pull it out. When a field comes back empty or is incorrect, confirm it appears in the Parse output correctly first, then refine the schema or parseConfig depending on whether the parse output is the issue or not.

Quick start

We’ll pull a few key fields from a bill of lading. For this quick-start we’ve uploaded the file here.

A bill of lading document

Grab a key from the Developers page and store it as the EXTEND_API_KEY environment variable. If you’re using an SDK, see the installation instructions.

$export EXTEND_API_KEY="your_api_key_here"

The /extract endpoint takes a file and a config with the schema you want to pull.

Python
TypeScript
Java
Go
cURL
1import os
2from extend_ai import Extend
3
4client = Extend(token=os.environ["EXTEND_API_KEY"])
5
6result = client.extract(
7 file={
8 "url": "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf",
9 },
10 config={
11 "schema": {
12 "type": "object",
13 "properties": {
14 "load_number": {
15 "type": ["string", "null"],
16 "description": "The load number on the bill of lading.",
17 },
18 "shipper_name": {
19 "type": ["string", "null"],
20 "description": "The name of the shipper or origin company.",
21 },
22 "consignee_name": {
23 "type": ["string", "null"],
24 "description": "The name of the consignee or destination company.",
25 },
26 "ship_date": {
27 "type": ["string", "null"],
28 "description": "The date the shipment was picked up.",
29 "extend:type": "date",
30 },
31 },
32 },
33 "advancedOptions": {"citationsEnabled": True},
34 },
35)
36
37print(result)

Want to extract from your own document? Upload it first, then pass the returned file id instead of a url (reusing the same config).

Python
TypeScript
Java
Go
cURL
1with open("bill_of_lading.pdf", "rb") as f:
2 uploaded = client.files.upload(file=f)
3
4result = client.extract(file={"id": uploaded.id}, config=config)

Example response

After you run the code snippet above, you’ll see a response like this. Extend parses the document, locates each field you described, and returns an output with two halves: value (your data, shaped like the schema) and metadata (per-field confidence and citations, keyed by the same field paths).

1{
2 "object": "extract_run",
3 "id": "exr_3f1j6I1gsw5k96xFiCnkM",
4 "status": "PROCESSED",
5 "output": {
6 "value": {
7 "load_number": "ABC-10025521",
8 "shipper_name": "Acme Manufacturing Co.",
9 "consignee_name": "Northwind Distribution LLC",
10 "ship_date": "2026-03-14"
11 },
12 "metadata": {
13 "load_number": {
14 "logprobsConfidence": 1,
15 "ocrConfidence": 0.99,
16 "citations": [
17 {
18 "page": { "number": 1, "width": 612, "height": 792 },
19 "polygon": [
20 { "x": 56.8, "y": 35.2 },
21 { "x": 162.2, "y": 35.2 },
22 { "x": 162.2, "y": 48.1 },
23 { "x": 56.8, "y": 48.1 }
24 ],
25 "referenceText": "Load No. ABC-10025521"
26 }
27 ]
28 },
29 "shipper_name": { "logprobsConfidence": 0.98, "ocrConfidence": 0.97, "citations": [...] },
30 "consignee_name": { "logprobsConfidence": 0.98, "ocrConfidence": 0.96, "citations": [...] },
31 "ship_date": { "logprobsConfidence": 0.99, "ocrConfidence": 0.98, "citations": [...] }
32 }
33 }
34}

Key fields

FieldWhat it contains
output.valueYour extracted data, shaped exactly like the schema you defined.
output.metadataPer-field details keyed by field path (for example line_items[0].description).
metadata[field].logprobsConfidenceModel confidence for the field, from 0 to 1.
metadata[field].ocrConfidenceOCR confidence for the underlying text, from 0 to 1.
metadata[field].citationsBounding-box references back to the source document (when citationsEnabled).

For full request/response details, see the Create Extract Run API reference.

Use the output

Read your data straight off output.value, and use output.metadata to trust high-confidence values and route the rest to review.

Python
TypeScript
Java
Go
1# Read the extracted values, shaped like your schema
2value = result.output.value
3print("Load number:", value["load_number"])
4
5# Use per-field OCR confidence to decide what to trust
6for field, meta in result.output.metadata.items():
7 if (meta.ocr_confidence or 0) < 0.9:
8 print(f"Low confidence on {field} — route to review")

For the full shape, including confidence scores, citations, and insights, see Response Format.

Sync vs async

The example above calls the synchronous /extract endpoint. We also have an asynchronous /extract_runs endpoint that should be used for large files and high volume use cases.

See Async Processing for the full comparison, polling options, and webhook setup.

Save it as a processor

The quick start runs with an inline config, which is perfect for getting started. To reuse a configuration across runs — and to version it, measure its accuracy, and optimize it — save it as an extractor, a kind of processor. Processors are the saved entities you iterate on in the dashboard, run evaluation sets against, and improve with Composer.

Configuration

The quick start sends just file and config.schema. To control how extraction runs, pass more options inside config. Here are the most commonly used ones; for the full reference, see Configuration.

Schema

The schema is the heart of every extraction — a JSON Schema describing the fields you want. Add array properties to pull repeating rows like line items, nest object properties for grouped data, and use extend:type for typed fields like dates and currency.

1{
2 "config": {
3 "schema": {
4 "type": "object",
5 "properties": {
6 "invoice_number": { "type": ["string", "null"], "description": "The invoice number." }
7 }
8 }
9 }
10}

Full schema reference →

Base processor

Choose the processor based on your accuracy and latency needs.

1{ "config": { "baseProcessor": "extraction_performance" } }
ProcessorWhen to use
extraction_performanceHighest accuracy across complex layouts and tables (default).
extraction_lightFaster, cheaper extraction for simple documents. Does not return logprobsConfidence.

Extraction rules

Steer the model with plain-language rules — useful for disambiguating fields, setting formats, or encoding business logic.

1{
2 "config": {
3 "extractionRules": "If multiple totals appear, use the grand total. Return all dates in ISO 8601 format."
4 }
5}

Citations

Return a bounding-box citation and source text for each field, so you can highlight where every value came from.

1{ "config": { "advancedOptions": { "citationsEnabled": true } } }

Generating citations uses an additional citation-focused model, which adds a moderate increase in latency. See Response Format.

Parse config

Because Extract runs Parse under the hood, you can tune how the document is parsed before extraction with parseConfig. Reach for this when a value isn’t being read correctly — for example, enabling agentic OCR for messy scans.

1{
2 "config": {
3 "parseConfig": {
4 "blockOptions": { "text": { "agentic": { "enabled": true } } }
5 }
6 }
7}

For every option, including advanced options, the review agent, and Excel settings, see the Configuration reference.


Next steps

Configuration

Property types, schema generation, and the full configuration reference.

Response Format

The value and metadata objects, confidence scores, and citations.

Schema

Define objects, arrays, enums, and typed fields.

Best Practices

Field naming, prompt crafting, and accuracy tuning.