Splitting Overview | Extend Documentation

Splitting takes a single file that bundles many documents and breaks it into separate, typed sub-documents. You describe the document types you expect with splitClassifications, and Extend returns one entry per detected sub-document with its type, page range, and a standalone fileId you can feed into parse, extract, or a workflow. Use it for loan packages, claim files, closing binders, and any multi-document upload that needs to be separated before processing.

Split runs Parse under the hood, parsing the file first if it hasn’t been parsed already and reusing the existing parsed output if it has.

Quick start

We’ll split a Uniform Residential Loan Application (Form 1003) into its sections — Section 1 spans two pages, and the rest are single pages. For this quick-start we’ve uploaded the file here.

Grab a key from the Developers page and store it as the EXTEND_API_KEY environment variable. If you’re using an SDK, see the installation instructions.

$ export EXTEND_API_KEY="your_api_key_here"

The /split endpoint takes a file and a config with the splitClassifications you expect.

Python

TypeScript

Java

Go

cURL

1 from extend_ai import Extend
2 
3 client = Extend()
4 
5 result = client.split(
6     file={
7         "url": "https://extend-public-files.s3.us-east-2.amazonaws.com/loan_application.pdf",
8     },
9     config={
10         "baseProcessor": "splitting_performance",
11         "splitClassifications": [
12             {
13                 "id": "section_1",
14                 "type": "section_1",
15                 "description": "Section 1, Borrower Information: personal details, current and previous employment, and income.",
16             },
17             {
18                 "id": "section_2",
19                 "type": "section_2",
20                 "description": "Section 2, Financial Information — Assets and Liabilities: bank and retirement accounts, other assets, liabilities, and expenses.",
21             },
22             {
23                 "id": "section_3",
24                 "type": "section_3",
25                 "description": "Section 3, Financial Information — Real Estate: properties owned and the mortgage loans on them.",
26             },
27             {
28                 "id": "other",
29                 "type": "other",
30                 "description": "Any other section of the loan application (loan and property information, declarations, acknowledgments, military service, demographic information, or loan originator details).",
31             },
32         ],
33     },
34 )
35 
36 print(result)

Want to split your own document? Upload it first, then pass the returned file id instead of a url (reusing the same config).

Python

TypeScript

Java

Go

cURL

1 with open("loan_application.pdf", "rb") as f:
2     uploaded = client.files.upload(file=f)
3 
4 result = client.split(file={"id": uploaded.id}, config=config)

Example response

After you run the code snippet above, you’ll see a response like this. Extend parses the document, finds each section, and returns an output.splits array — one entry per detected sub-document with its type, page range, and a fileId you can process further. (Truncated to the first two splits.)

1 {
2   "object": "split_run",
3   "id": "splr_Xj8mK2pL9nR4vT7qY5wZ",
4   "status": "PROCESSED",
5   "file": {
6     "object": "file",
7     "id": "file_GzKUy0VDhHscv7tweODYb",
8     "name": "loan_application.pdf"
9   },
10   "output": {
11     "splits": [
12       {
13         "id": "splt_xK9mLPqRtN3vS8wF5hB2cQ",
14         "classificationId": "section_1",
15         "type": "section_1",
16         "startPage": 1,
17         "endPage": 2,
18         "identifier": "",
19         "observation": "Pages 1-2 contain Section 1: Borrower Information.",
20         "fileId": "file_8sLPqRtN3vS2wF5hB2cQ"
21       },
22       {
23         "id": "splt_2pL9nR4vT7qY5wZj8mK2",
24         "classificationId": "section_2",
25         "type": "section_2",
26         "startPage": 3,
27         "endPage": 3,
28         "identifier": "",
29         "observation": "Page 3 contains Section 2: Financial Information — Assets and Liabilities.",
30         "fileId": "file_R4vT7qY5wZj8mK2pL9nR"
31       }
32     ]
33   }
34 }

Key fields

Field	What it contains
`output.splits`	One entry per detected sub-document.
`splits[].classificationId`	The `id` of the classification that matched. Branch your logic on this — it’s stable.
`splits[].type`	The document type, matching a classification you defined.
`splits[].startPage` / `endPage`	The 1-based page range of the sub-document.
`splits[].identifier`	The extracted identifier, when the classification set an `identifierKey`.
`splits[].fileId`	A standalone file for the sub-document, usable as input to other endpoints.

For full request/response details, see the Create Split Run API reference.

Use the output

Walk output.splits to read each sub-document’s type and page range, and use its fileId to process the piece — for example, sending a specific section to Extract. Branch your logic on classificationId rather than type: the classificationId is the stable id you defined, while type and description are part of the prompt that steers the split and may change as you tune accuracy.

Python

TypeScript

Java

Go

1 for split in result.output.splits:
2     print(f"{split.type}: pages {split.start_page}-{split.end_page}")
3 
4 # Each split is a standalone file you can process further
5 for split in result.output.splits:
6     if split.classification_id == "section_2":
7         client.extract(file={"id": split.file_id}, config={"schema": {...}})

For the full shape, including every field on each split, see Response Format.

Sync vs async

The example above calls the synchronous /split endpoint. We also have an asynchronous /split_runs endpoint that should be used for large files and high volume use cases.

See Async Processing for the full comparison, polling options, and webhook setup.

Save it as a processor

The quick start runs with an inline config, which is perfect for getting started. To reuse a configuration across runs — and to version it, measure its accuracy, and optimize it — save it as a splitter, a kind of processor. Processors are the saved entities you iterate on in the dashboard, run evaluation sets against, and improve with Composer.

Configuration

The quick start sends file and config.splitClassifications. To control how splitting runs, pass more options inside config. Here are the most commonly used ones; for the full reference, see Configuration.

Split classifications

The splitClassifications define the document types the splitter can assign. Provide at least one, and at least one must have the type "other" as a catch-all. The description on each is your biggest lever on accuracy.

1 {
2   "config": {
3     "splitClassifications": [
4       { "id": "invoice", "type": "invoice", "description": "An invoice or bill for goods or services." },
5       { "id": "other", "type": "other", "description": "Any other document type." }
6     ]
7   }
8 }

Identifier keys

Add an identifierKey to a classification to extract a unique identifier (like an invoice number or borrower name) from each sub-document of that type. The value is returned in each split’s identifier, and the splitter uses it to decide when adjacent pages belong to the same document.

1 {
2   "config": {
3     "splitClassifications": [
4       { "id": "invoice", "type": "invoice", "description": "An invoice.", "identifierKey": "The invoice number from the header." }
5     ]
6   }
7 }

Split rules

Steer how the document is divided with plain-language splitRules — for example, keeping multi-page contracts together.

1 { "config": { "splitRules": "Keep all pages of a signed contract together in a single split." } }

Base processor

Choose the splitting model based on your accuracy and latency needs.

1 { "config": { "baseProcessor": "splitting_performance" } }

Processor	When to use
`splitting_performance`	Highest accuracy (default).
`splitting_light`	Faster and cheaper.

For every option, including advanced options and parse configuration, see the Configuration reference.

Next steps

Configuration

Split classifications, identifier keys, rules, and the base processor.

Response Format

The full shape of the split run and the splits array.

Workflows

Route each split sub-document to the right processor automatically.

API Reference

Full request and response schema for the split endpoint.