> ## Documentation Index
> Fetch the complete documentation index at: https://docs.extend.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

> Extract turns any document into structured JSON defined by a schema and backed by citations. Learn how it works, then run it in minutes.

**Extraction** pulls the exact fields you need out of a document and returns them as structured JSON. You define a `schema` describing the fields you want, and Extend returns each value with confidence scores and citations back to the source document. Use it to turn receipts, invoices, statements, contracts, and forms into reliable, machine-readable data.

## How Extract works

Extract runs [Parse](/parsing/overview) under the hood: it first turns the document into clean, structured content, then uses AI to locate and return only the fields your schema defines, even across complex tables and layouts.

That targeting is what separates the two endpoints. Parse gives you *everything* in the document as structured chunks with positions and types — the right choice for RAG, document viewers, or feeding full context to an LLM. Extract gives you back *only* the specific fields you asked for, shaped like your schema.

Because Extract reads from what Parse produces, **it can only return what Parse sees.** If a value never makes it into the parsed content — an OCR miss or a mangled table — no amount of schema tweaking will pull it out. When a field comes back empty or is incorrect, confirm it appears in the [Parse](/parsing/overview) output correctly first, then refine the schema or `parseConfig` depending on whether the parse output is the issue or not.

## Quick start

We'll pull a few key fields from a bill of lading. For this quick-start we've uploaded the file [here.](https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf)

<img src="https://files.buildwithfern.com/extendconfig.docs.buildwithfern.com/fb1cab6f04d9964ae6b1411bd2bad11036ff2f7ca6bd7d316dffde967971e2f4/assets/images/quickstart/bill_of_lading_page_1.png" alt="A bill of lading document" decoding="async" />

Grab a key from the [Developers](https://dashboard.extend.ai/developers) page and store it as the `EXTEND_API_KEY` environment variable. If you're using an SDK, see the [installation instructions](/sdks).

```bash
export EXTEND_API_KEY="your_api_key_here"
```

The `/extract` endpoint takes a `file` and a `config` with the `schema` you want to pull.

```python
from extend_ai import Extend

client = Extend()

result = client.extract(
    file={
        "url": "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf",
    },
    config={
        "schema": {
            "type": "object",
            "properties": {
                "load_number": {
                    "type": ["string", "null"],
                    "description": "The load number on the bill of lading.",
                },
                "shipper_name": {
                    "type": ["string", "null"],
                    "description": "The name of the shipper or origin company.",
                },
                "consignee_name": {
                    "type": ["string", "null"],
                    "description": "The name of the consignee or destination company.",
                },
                "ship_date": {
                    "type": ["string", "null"],
                    "description": "The date the shipment was picked up.",
                    "extend:type": "date",
                },
            },
        },
        "advancedOptions": {"citationsEnabled": True},
    },
)

print(result)
```

```typescript
import { ExtendClient } from "extend-ai";

const client = new ExtendClient();

const result = await client.extract({
  file: {
    url: "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf",
  },
  config: {
    schema: {
      type: "object",
      properties: {
        load_number: {
          type: ["string", "null"],
          description: "The load number on the bill of lading.",
        },
        shipper_name: {
          type: ["string", "null"],
          description: "The name of the shipper or origin company.",
        },
        consignee_name: {
          type: ["string", "null"],
          description: "The name of the consignee or destination company.",
        },
        ship_date: {
          type: ["string", "null"],
          description: "The date the shipment was picked up.",
          "extend:type": "date",
        },
      },
    },
    advancedOptions: { citationsEnabled: true },
  },
});

console.log(result);
```

```java
import ai.extend.ExtendClient;
import ai.extend.requests.ExtractRequest;
import ai.extend.types.ExtractAdvancedOptions;
import ai.extend.types.ExtractConfigJson;
import ai.extend.types.ExtractRequestFile;
import ai.extend.types.ExtractRun;
import ai.extend.types.FileFromUrl;
import java.util.List;
import java.util.Map;

ExtendClient client = ExtendClient.builder().build();

Map<String, Object> schema = Map.of(
    "type", "object",
    "properties", Map.of(
        "load_number", Map.of("type", List.of("string", "null"), "description", "The load number on the bill of lading."),
        "shipper_name", Map.of("type", List.of("string", "null"), "description", "The name of the shipper or origin company."),
        "consignee_name", Map.of("type", List.of("string", "null"), "description", "The name of the consignee or destination company."),
        "ship_date", Map.of("type", List.of("string", "null"), "description", "The date the shipment was picked up.", "extend:type", "date")));

ExtractRun result = client.extract(ExtractRequest.builder()
    .file(ExtractRequestFile.of(FileFromUrl.builder()
        .url("https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf")
        .build()))
    .config(ExtractConfigJson.builder()
        .schema(schema)
        .advancedOptions(ExtractAdvancedOptions.builder().citationsEnabled(true).build())
        .build())
    .build());

System.out.println(result);
```

```go
package main

import (
	"context"
	"fmt"
	"log"

	extend "github.com/extend-hq/extend-go-sdk"
	client "github.com/extend-hq/extend-go-sdk/client"
)

func main() {
	c := client.NewClient()

	config := &extend.ExtractConfigJSON{
		Schema: map[string]any{
			"type": "object",
			"properties": map[string]any{
				"load_number":    map[string]any{"type": []string{"string", "null"}, "description": "The load number on the bill of lading."},
				"shipper_name":   map[string]any{"type": []string{"string", "null"}, "description": "The name of the shipper or origin company."},
				"consignee_name": map[string]any{"type": []string{"string", "null"}, "description": "The name of the consignee or destination company."},
				"ship_date":      map[string]any{"type": []string{"string", "null"}, "description": "The date the shipment was picked up.", "extend:type": "date"},
			},
		},
		AdvancedOptions: &extend.ExtractAdvancedOptions{
			CitationsEnabled: extend.Bool(true),
		},
	}

	result, err := c.Extract(context.TODO(), &extend.ExtractRequest{
		File: &extend.ExtractRequestFile{
			FileFromURL: &extend.FileFromURL{
				URL: "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf",
			},
		},
		Config: config,
	})
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(result)
}
```

```bash
curl -X POST https://api.extend.ai/extract \
  -H "x-extend-api-version: 2026-02-09" \
  -H "Authorization: Bearer $EXTEND_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": {
      "url": "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf"
    },
    "config": {
      "schema": {
        "type": "object",
        "properties": {
          "load_number": { "type": ["string", "null"], "description": "The load number on the bill of lading." },
          "shipper_name": { "type": ["string", "null"], "description": "The name of the shipper or origin company." },
          "consignee_name": { "type": ["string", "null"], "description": "The name of the consignee or destination company." },
          "ship_date": { "type": ["string", "null"], "description": "The date the shipment was picked up.", "extend:type": "date" }
        }
      },
      "advancedOptions": { "citationsEnabled": true }
    }
  }'
```

Want to extract from your own document? [Upload it](/api-reference/endpoints/file/upload-file) first, then pass the returned file `id` instead of a `url` (reusing the same `config`).

```python
with open("bill_of_lading.pdf", "rb") as f:
    uploaded = client.files.upload(file=f)

result = client.extract(file={"id": uploaded.id}, config=config)
```

```typescript
import { createReadStream } from "fs";

const uploaded = await client.files.upload(createReadStream("bill_of_lading.pdf"), {});

const result = await client.extract({ file: { id: uploaded.id }, config });
```

```java
import ai.extend.requests.FilesUploadRequest;
import ai.extend.types.File;
import ai.extend.types.FileFromId;

File uploaded = client.files().upload(
    new java.io.File("bill_of_lading.pdf"),
    FilesUploadRequest.builder().build());

ExtractRun result = client.extract(ExtractRequest.builder()
    .file(ExtractRequestFile.of(FileFromId.builder().id(uploaded.getId()).build()))
    .config(config) // the ExtractConfigJson from above
    .build());
```

```go
f, err := os.Open("bill_of_lading.pdf")
if err != nil {
	log.Fatal(err)
}
defer f.Close()

uploaded, err := c.Files.Upload(context.TODO(), f, &extend.FilesUploadRequest{})
if err != nil {
	log.Fatal(err)
}

result, err := c.Extract(context.TODO(), &extend.ExtractRequest{
	File:   &extend.ExtractRequestFile{FileFromID: &extend.FileFromID{ID: uploaded.ID}},
	Config: config, // the *extend.ExtractConfigJSON from above
})
```

```bash
# Upload the file to get an id
curl -X POST https://api.extend.ai/files/upload \
  -H "Authorization: Bearer $EXTEND_API_KEY" \
  -H "x-extend-api-version: 2026-02-09" \
  -F "file=@bill_of_lading.pdf"

# Then extract using the returned id
curl -X POST https://api.extend.ai/extract \
  -H "Authorization: Bearer $EXTEND_API_KEY" \
  -H "x-extend-api-version: 2026-02-09" \
  -H "Content-Type: application/json" \
  -d '{ "file": { "id": "file_xK9mLPqRtN3vS8wF5hB2cQ" }, "config": { "schema": { "type": "object", "properties": {} } } }'
```

### Example response

After you run the code snippet above, you'll see a response like this. Extend parses the document, locates each field you described, and returns an `output` with two halves: `value` (your data, shaped like the schema) and `metadata` (per-field confidence and citations, keyed by the same field paths).

```json
{
  "object": "extract_run",
  "id": "exr_3f1j6I1gsw5k96xFiCnkM",
  "status": "PROCESSED",
  "output": {
    "value": {
      "load_number": "ABC-10025521",
      "shipper_name": "Acme Manufacturing Co.",
      "consignee_name": "Northwind Distribution LLC",
      "ship_date": "2026-03-14"
    },
    "metadata": {
      "load_number": {
        "logprobsConfidence": 1,
        "ocrConfidence": 0.99,
        "citations": [
          {
            "page": { "number": 1, "width": 612, "height": 792 },
            "polygon": [
              { "x": 56.8, "y": 35.2 },
              { "x": 162.2, "y": 35.2 },
              { "x": 162.2, "y": 48.1 },
              { "x": 56.8, "y": 48.1 }
            ],
            "referenceText": "Load No. ABC-10025521"
          }
        ]
      },
      "shipper_name": { "logprobsConfidence": 0.98, "ocrConfidence": 0.97, "citations": [...] },
      "consignee_name": { "logprobsConfidence": 0.98, "ocrConfidence": 0.96, "citations": [...] },
      "ship_date": { "logprobsConfidence": 0.99, "ocrConfidence": 0.98, "citations": [...] }
    }
  }
}
```

### Key fields

| Field                                | What it contains                                                                 |
| ------------------------------------ | -------------------------------------------------------------------------------- |
| `output.value`                       | Your extracted data, shaped exactly like the `schema` you defined.               |
| `output.metadata`                    | Per-field details keyed by field path (for example `line_items[0].description`). |
| `metadata[field].logprobsConfidence` | Model confidence for the field, from `0` to `1`.                                 |
| `metadata[field].ocrConfidence`      | OCR confidence for the underlying text, from `0` to `1`.                         |
| `metadata[field].citations`          | Bounding-box references back to the source document (when `citationsEnabled`).   |

For full request/response details, see the [Create Extract Run API reference](/api-reference/endpoints/extract/create-extract-run).

### Use the output

Read your data straight off `output.value`, and use `output.metadata` to trust high-confidence values and route the rest to review.

```python
# Read the extracted values, shaped like your schema
value = result.output.value
print("Load number:", value["load_number"])

# Use per-field OCR confidence to decide what to trust
for field, meta in result.output.metadata.items():
    if (meta.ocr_confidence or 0) < 0.9:
        print(f"Low confidence on {field} — route to review")
```

```typescript
import { Extend } from "extend-ai";

const output = result.output as Extend.ExtractOutputJson;

// Read the extracted values, shaped like your schema
console.log("Load number:", output.value.load_number);

// Use per-field OCR confidence to decide what to trust
for (const [field, meta] of Object.entries(output.metadata)) {
  if ((meta?.ocrConfidence ?? 0) < 0.9) {
    console.log(`Low confidence on ${field} — route to review`);
  }
}
```

```java
import ai.extend.types.ExtractOutputJson;

ExtractOutputJson output = (ExtractOutputJson) result.getOutput().get().get();

// Read the extracted values, shaped like your schema
System.out.println("Load number: " + output.getValue().get("load_number"));

// Use per-field OCR confidence to decide what to trust
output.getMetadata().forEach((field, meta) -> {
    if (meta.getOcrConfidence().orElse(0.0) < 0.9) {
        System.out.println("Low confidence on " + field + " — route to review");
    }
});
```

```go
output := result.Output.GetExtractOutputJSON()

// Read the extracted values, shaped like your schema
fmt.Println("Load number:", output.Value["load_number"])

// Use per-field OCR confidence to decide what to trust
for field, meta := range output.Metadata {
	if meta.OcrConfidence == nil || *meta.OcrConfidence < 0.9 {
		fmt.Printf("Low confidence on %s — route to review\n", field)
	}
}
```

For the full shape, including confidence scores, citations, and insights, see [Response Format](/extraction/response-format).

## Schema-less extraction (infer schema at runtime)

Don't have a schema yet? You can omit `config` entirely (or pass `config` without a `schema`) and Extend will automatically infer a schema from the document before extracting. This is useful for exploring unfamiliar document types, prototyping, or any case where you want structured data without defining the shape upfront.

```bash
curl -X POST https://api.extend.ai/extract_runs \
  -H "x-extend-api-version: 2026-02-09" \
  -H "Authorization: Bearer $EXTEND_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": { "url": "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf" }
  }'
```

Use `extractionRules` to guide what the inferred schema focuses on:

```python
result = client.extract_runs.create(
    file={
        "url": "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf",
    },
    config={
        "extractionRules": "This is a freight bill of lading. We are most concerned with shipment tracking and logistics fields: shipper name and address, consignee name and address, carrier name, PRO number, load number, pickup date, delivery date, and total weight.",
    },
)
```

```typescript
const run = await client.extractRuns.create({
  file: {
    url: "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf",
  },
  config: {
    extractionRules: "This is a freight bill of lading. We are most concerned with shipment tracking and logistics fields: shipper name and address, consignee name and address, carrier name, PRO number, load number, pickup date, delivery date, and total weight.",
  },
});
```

```java
ExtractRun run = client.extractRuns().create(ExtractRunsCreateRequest.builder()
    .file(ExtractRunsCreateRequestFile.of(FileFromUrl.builder()
        .url("https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf")
        .build()))
    .config(ExtractConfigJson.builder()
        .extractionRules("This is a freight bill of lading. We are most concerned with shipment tracking and logistics fields: shipper name and address, consignee name and address, carrier name, PRO number, load number, pickup date, delivery date, and total weight.")
        .build())
    .build());
```

```go
run, err := c.ExtractRuns.Create(context.TODO(), &extend.ExtractRunCreateRequest{
    File: &extend.ExtractRunCreateRequestFile{
        FileFromURL: &extend.FileFromURL{
            URL: "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf",
        },
    },
    Config: &extend.ExtractConfigJSON{
        ExtractionRules: extend.String("This is a freight bill of lading. We are most concerned with shipment tracking and logistics fields: shipper name and address, consignee name and address, carrier name, PRO number, load number, pickup date, delivery date, and total weight."),
    },
})
```

```bash
curl -X POST https://api.extend.ai/extract_runs \
  -H "x-extend-api-version: 2026-02-09" \
  -H "Authorization: Bearer $EXTEND_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "file": { "url": "https://extend-public-files.s3.us-east-2.amazonaws.com/bill-of-lading.pdf" },
    "config": { "extractionRules": "This is a freight bill of lading. We are most concerned with shipment tracking and logistics fields: shipper name and address, consignee name and address, carrier name, PRO number, load number, pickup date, delivery date, and total weight." }
  }'
```

Once complete, `config.schema` in the response reflects the inferred schema so you can inspect or save it.

`file.base64` is not supported when using an inferred schema. Use a `url`, a previously-uploaded file `id`, or raw `text` instead.

## Sync vs async

The quick start above calls the synchronous `/extract` endpoint. We also have an asynchronous `/extract_runs` endpoint that should be used for large files and high volume use cases.

See [Async Processing](/general/async-processing) for the full comparison, polling options, and webhook setup.

## Save it as a processor

The quick start runs with an inline `config`, which is perfect for getting started. To reuse a configuration across runs — and to version it, measure its accuracy, and optimize it — save it as an **extractor**, a kind of [processor](/evaluation/processors). Processors are the saved entities you iterate on in the dashboard, run [evaluation sets](/evaluation/overview) against, and improve with [Composer](/optimization/composer).

## Configuration

The quick start sends just `file` and `config.schema`. To control how extraction runs, pass more options inside `config`. Here are the most commonly used ones; for the full reference, see [Configuration](/extraction/configuring-an-extractor).

### Schema

The `schema` is the heart of every extraction — a JSON Schema describing the fields you want. Add `array` properties to pull repeating rows like line items, nest `object` properties for grouped data, and use `extend:type` for typed fields like dates and currency.

```json
{
  "config": {
    "schema": {
      "type": "object",
      "properties": {
        "invoice_number": { "type": ["string", "null"], "description": "The invoice number." }
      }
    }
  }
}
```

[Full schema reference →](/extraction/schema)

### Base processor

Choose the processor based on your accuracy and latency needs.

```json
{ "config": { "baseProcessor": "extraction_performance" } }
```

| Processor                | When to use                                                                            |
| ------------------------ | -------------------------------------------------------------------------------------- |
| `extraction_performance` | Highest accuracy across complex layouts and tables (default).                          |
| `extraction_light`       | Faster, cheaper extraction for simple documents. Does not return `logprobsConfidence`. |

### Extraction rules

Steer the model with plain-language rules — useful for disambiguating fields, setting formats, or encoding business logic.

```json
{
  "config": {
    "extractionRules": "If multiple totals appear, use the grand total. Return all dates in ISO 8601 format."
  }
}
```

### Citations

Return a bounding-box citation and source text for each field, so you can highlight where every value came from.

```json
{ "config": { "advancedOptions": { "citationsEnabled": true } } }
```

Generating citations uses an additional citation-focused model, which adds a moderate increase in latency. See [Response Format](/extraction/response-format#citations).

### Parse config

Because Extract runs Parse under the hood, you can tune how the document is parsed before extraction with `parseConfig`. Reach for this when a value isn't being read correctly — for example, enabling agentic OCR for messy scans.

```json
{
  "config": {
    "parseConfig": {
      "blockOptions": { "text": { "agentic": { "enabled": true } } }
    }
  }
}
```

For every option, including advanced options, the review agent, and Excel settings, see the [Configuration](/extraction/configuring-an-extractor) reference.

***

## Next steps

Property types, schema generation, and the full configuration reference.

The value and metadata objects, confidence scores, and citations.

Define objects, arrays, enums, and typed fields.

Field naming, prompt crafting, and accuracy tuning.