Parse File | extend

The Parse endpoint allows you to convert documents into structured, machine-readable formats with fine-grained control over the parsing process. This endpoint is ideal for extracting cleaned document content to be used as context for downstream processing, e.g. RAG pipelines, custom ingestion pipelines, embeddings classification, etc.

For a deeper guide on how to use the output of this endpoint, jump to Using Parsed Output.

Choosing a target? See Markdown vs Spatial for guidance.

Node.js

1 const axios = require("axios");
2 
3 const parseDocument = async () => {
4   try {
5     const response = await axios.post(
6       "https://api.extend.ai/parse",
7       {
8         file: {
9           fileName: "example.pdf",
10           fileUrl: "https://example.com/documents/example.pdf",
11         },
12         config: {
13           target: "markdown",
14           chunkingStrategy: {
15             type: "page",
16           },
17           blockOptions: {
18             figures: {
19               enabled: true,
20               figureImageClippingEnabled: true,
21             },
22             tables: {
23               tableHeaderContinuationEnabled: false,
24             },
25             text: {
26               signatureDetectionEnabled: true,
27             },
28           },
29         },
30       },
31       {
32         headers: {
33           Authorization: "Bearer <API_TOKEN>",
34           "Content-Type": "application/json",
35         },
36       }
37     );
38 
39     console.log("Document parsed successfully:", response.data);
40   } catch (error) {
41     console.error("Error:", error.response?.data || error.message);
42   }
43 };
44 
45 parseDocument();

Using Parsed Output

The Parse API returns document content in a structured format that provides both high-level formatted content and detailed block-level information. Understanding how to work with this output will help you get the most value from the parsing service.

Working with Chunks

Each chunk (currently only page-level chunks are supported) contains two key properties:

content: A fully formatted representation of the entire chunk in the target format (e.g., markdown). This is ready to use as-is if you need the complete formatted content of a page.
blocks: An array of individual content blocks that make up the chunk, each with its own formatting, position information, and metadata.

When to use `chunk.content` vs. `chunk.blocks`

Use chunk.content when:
- You need the complete, properly formatted content of a page, already doing the logical placement of blocks (e.g. grouping markdown sections and placing spatially, etc)
- You want to display or process the document content as a whole (and can just combine all chunk.content values)
- You’re integrating with systems that expect formatted text (e.g., markdown processors)
Use chunk.blocks when:
- You need to work with specific elements of the document (e.g., only tables or figures)
- You need spatial information about where content appears on the page, perhaps to build citation systems
- You’re building a UI that shows or highlights specific document elements

Example: Extracting specific content types

1 // Extract all tables from a document
2 function extractTables(parseResult) {
3   const tables = [];
4   
5   parseResult.chunks.forEach(chunk => {
6     chunk.blocks.forEach(block => {
7       if (block.type === 'table') {
8         tables.push({
9           content: block.content,
10           pageNumber: block.metadata.pageNumber,
11           position: block.boundingBox
12         });
13       }
14     });
15   });
16   
17   return tables;
18 }
19 
20 // Extract all figures with their images
21 function extractFigures(parseResult) {
22   const figures = [];
23   
24   parseResult.chunks.forEach(chunk => {
25     chunk.blocks.forEach(block => {
26       if (block.type === 'figure' && block.details.imageUrl) {
27         figures.push({
28           caption: block.content,
29           imageUrl: block.details.imageUrl,
30           figureType: block.details.figureType,
31           pageNumber: block.metadata.pageNumber
32         });
33       }
34     });
35   });
36   
37   return figures;
38 }

Example: Reconstructing content with custom formatting

1 // Extract headings and their content to create a table of contents
2 function createTableOfContents(parseResult) {
3   const toc = [];
4   
5   parseResult.chunks.forEach(chunk => {
6     chunk.blocks.forEach(block => {
7       if (block.type === 'heading' || block.type === 'section_heading') {
8         toc.push({
9           title: block.content,
10           pageNumber: block.metadata.pageNumber
11         });
12       }
13     });
14   });
15   
16   return toc;
17 }

Spatial Information

Each block contains spatial information in the form of a polygon (precise outline) and a simplified boundingBox. This information can be used to:

Highlight specific content in a document viewer
Create visual overlays on top of the original document
Understand the reading order and layout of the document

1 // Create highlight coordinates for a document viewer
2 function createHighlights(parseResult, searchTerm) {
3   const highlights = [];
4   
5   parseResult.chunks.forEach(chunk => {
6     chunk.blocks.forEach(block => {
7       if (block.type === 'text' && block.content.includes(searchTerm)) {
8         highlights.push({
9           pageNumber: block.metadata.pageNumber,
10           boundingBox: block.boundingBox
11         });
12       }
13     });
14   });
15   
16   return highlights;
17 }

By leveraging both the formatted content and the structured block information, you can build powerful document processing workflows that combine the convenience of formatted text with the precision of block-level access.

Markdown vs Spatial

markdown: Clean, logical reading order using true markdown constructs (headings, lists, tables, checkboxes). Supports section-aware chunking and works best for LLMs and RAG.
spatial: Layout/position-preserving text that uses markdown elements for block types (e.g. tables, checkboxes) but is not strictly markdown due to tabs/whitespace used to preserve placement. Chunks are page-based only.
When to choose markdown:
- Default choice for most documents
- Better for multi‑column layouts (content is linearized into readable paragraphs)
- Enables logical section chunking for improved retrieval
When to choose spatial:
- Very messy, scanned, or handwritten docs (e.g. healthcare notes, skewed scans)
- You need a near 1:1 text representation of the original layout
- You rely on spatial consistency or vector/distance‑based clustering across documents
- BOLs and similar scanned logistics documents often perform better

Tip: If unsure, start with markdown. Switch to spatial if you need layout fidelity or encounter scanned/skewed inputs where reading order is unreliable.

Configuration Options

The Parse API accepts a config object that controls how documents are parsed and processed. Configuration options are organized into several categories:

Target Format: Output format for parsed content (markdown vs spatial)
Chunking Strategy: How the document is divided into chunks
Block Options: Fine-grained control over parsing specific layout types
Advanced Options: OCR enhancements and page filtering

Target Format

`target`

Type: "markdown" | "spatial"
Default: "markdown"

Determines how content is extracted and formatted from the document. See Markdown vs Spatial above for detailed guidance on choosing between these options.

1 {
2   "config": {
3     "target": "markdown"
4   }
5 }

Chunking Strategy

`chunkingStrategy.type`

Type: "page" | "section" | "document"
Default: "page"

Determines the granularity of document chunking.

"page" - Creates a separate chunk for each page of the document. Compatible with both markdown and spatial targets.

"section" - Chunks the document into logical sections based on markdown structure (headings, subheadings). The parser ensures logical groups of content are preserved by never breaking markdown elements across chunks. Only works with target: "markdown". This is ideal for RAG systems where each chunk should be a complete semantic unit.

"document" - Treats the entire document as a single chunk. Use this for small documents or when you have custom downstream chunking requirements.

1 {
2   "config": {
3     "chunkingStrategy": {
4       "type": "section"
5     }
6   }
7 }

`chunkingStrategy.options`

Fine-grained control over chunk sizing. Only applies when using type: "section" with target: "markdown". These options are ignored for "page" and "document" chunking types.

minCharacters (number, default: 500) - The minimum number of characters per chunk. Small sections may be combined to meet this minimum.

maxCharacters (number, default: 5000) - The maximum number of characters per chunk. Long sections will be split at natural boundaries when possible.

1 {
2   "config": {
3     "chunkingStrategy": {
4       "type": "section",
5       "options": {
6         "minCharacters": 500,
7         "maxCharacters": 2000
8       }
9     }
10   }
11 }

Block Options

Fine-grained control over how specific content types are detected and formatted.

Figures

blockOptions.figures.enabled (boolean, default: true) - Enables or disables figure detection and parsing. When enabled, the parser uses a VLM to analyze and extract content from each figure. Note: This adds processing latency, especially for documents with many figures. Disable for fastest processing.

blockOptions.figures.figureImageClippingEnabled (boolean, default: true) - When enabled, extracts figure images from the document and uploads them to blob storage, providing presigned URLs in the output. Each figure is cropped from the page and saved as a PNG.

1 {
2   "config": {
3     "blockOptions": {
4       "figures": {
5         "enabled": true,
6         "figureImageClippingEnabled": true
7       }
8     }
9   }
10 }

Tables

blockOptions.tables.targetFormat ("markdown" | "html", default: "markdown") - Controls the output format for tables.

markdown: Human-readable pipe syntax, works well with LLMs
html: Preserves complex table structure (merged cells, rowspan, colspan)

blockOptions.tables.tableHeaderContinuationEnabled (boolean, default: false) - When enabled, automatically propagates table headers across page breaks. Useful for long tables spanning multiple pages where headers are only present on the first page.

1 {
2   "config": {
3     "blockOptions": {
4       "tables": {
5         "targetFormat": "html",
6         "tableHeaderContinuationEnabled": true
7       }
8     }
9   }
10 }

Text

blockOptions.text.signatureDetectionEnabled (boolean, default: true in API) - Enables advanced signature detection. When enabled, identifies handwritten signatures, initials, and signature blocks in the document.

1 {
2   "config": {
3     "blockOptions": {
4       "text": {
5         "signatureDetectionEnabled": true
6       }
7     }
8   }
9 }

Advanced Options

`advancedOptions.agenticOcrEnabled`

Type: boolean
Default: false

Enables agentic OCR - an advanced feature that uses a vision-language model (VLM) to review and correct low-confidence OCR results. The system automatically identifies text regions with low OCR confidence and applies AI-based corrections.

When to enable:

Handwritten documents or forms
Poor quality scans
Historical documents or faded text
Mixed print and handwritten content

Note: Increases latency, especially when every page is scanned.

1 {
2   "config": {
3     "advancedOptions": {
4       "agenticOcrEnabled": true
5     }
6   }
7 }

`advancedOptions.pageRotationEnabled`

Type: boolean
Default: true

Enables automatic page rotation detection and correction. The system detects if pages are rotated and automatically rotates them to the correct orientation before parsing.

1 {
2   "config": {
3     "advancedOptions": {
4       "pageRotationEnabled": true
5     }
6   }
7 }

`advancedOptions.pageRanges`

Type: Array<{ start: number, end: number }>
Default: [] (all pages)

Specifies which pages of the document to process. Page numbers are 1-based and ranges are inclusive. Ranges can overlap and be in any order - the system automatically merges and sorts them.

1 {
2   "config": {
3     "advancedOptions": {
4       "pageRanges": [
5         { "start": 1, "end": 3 },
6         { "start": 10, "end": 10 }
7       ]
8     }
9   }
10 }

1-based, inclusive page numbers
Ranges can overlap or arrive out of order; the platform merges and sorts them automatically
Omit the field or pass [] to process the full document (subject to global limits)
You are only billed for pages actually processed

The default page limit is 300, and the maximum document size is 750 pages.

Response Type

You can specify the response type in the responseType query parameter for the Parse File and Get Parser Run endpoints.

json - Returns parsed outputs in the response body
url - Return a presigned URL to the parsed content in the response body

Complete Configuration Examples

Example 1: Optimized for RAG Pipeline

Section-based chunking with semantic boundaries:

1 {
2   "config": {
3     "target": "markdown",
4     "chunkingStrategy": {
5       "type": "section",
6       "options": {
7         "minCharacters": 500,
8         "maxCharacters": 2000
9       }
10     },
11     "blockOptions": {
12       "figures": {
13         "enabled": true,
14         "figureImageClippingEnabled": false
15       },
16       "tables": {
17         "targetFormat": "html"
18       }
19     },
20     "advancedOptions": {
21       "pageRotationEnabled": true
22     }
23   }
24 }

Example 2: High-Volume Processing (Performance-Optimized)

Performance-optimized configuration for fast, high-volume processing:

1 {
2   "config": {
3     "target": "markdown",
4     "chunkingStrategy": {
5       "type": "page"
6     },
7     "blockOptions": {
8       "figures": {
9         "enabled": false
10       },
11       "tables": {
12         "targetFormat": "markdown"
13       },
14       "text": {
15         "signatureDetectionEnabled": false
16       }
17     },
18     "advancedOptions": {
19       "pageRotationEnabled": false,
20       "agenticOcrEnabled": false
21     }
22   }
23 }

Example 3: Complex Legal Documents

Maximum accuracy with signature detection and header continuation:

1 {
2   "config": {
3     "target": "markdown",
4     "chunkingStrategy": {
5       "type": "section"
6     },
7     "blockOptions": {
8       "figures": {
9         "enabled": true,
10         "figureImageClippingEnabled": true
11       },
12       "tables": {
13         "targetFormat": "html",
14         "tableHeaderContinuationEnabled": true
15       },
16       "text": {
17         "signatureDetectionEnabled": true
18       }
19     },
20     "advancedOptions": {
21       "agenticOcrEnabled": true,
22       "pageRotationEnabled": true
23     }
24   }
25 }

Example 4: Handwritten Forms

Optimized for handwritten or degraded documents:

1 {
2   "config": {
3     "target": "spatial",
4     "chunkingStrategy": {
5       "type": "page"
6     },
7     "blockOptions": {
8       "figures": {
9         "enabled": false
10       },
11       "text": {
12         "signatureDetectionEnabled": true
13       }
14     },
15     "advancedOptions": {
16       "agenticOcrEnabled": true,
17       "pageRotationEnabled": true
18     }
19   }
20 }

Best Practices

Performance Optimization

For fastest processing:

Disable agenticOcrEnabled (most significant speedup - avoids AI-based OCR corrections)
Set figures: { enabled: false } (avoids AI-based figure analysis)
Disable pageRotationEnabled if all pages are correctly oriented
Use chunkingStrategy: "document" (fastest) or "page" instead of "section" (section chunking adds CPU overhead during parsing to generate semantic sections)

For highest accuracy:

Set target: "markdown" with chunkingStrategy: "section"
Enable agenticOcrEnabled for handwritten/degraded documents
Enable tableHeaderContinuationEnabled for multi-page tables
Enable signatureDetectionEnabled for legal documents

Troubleshooting

Poor quality OCR results

Enable agenticOcrEnabled: true for handwritten/degraded documents
Ensure pageRotationEnabled: true for rotated pages
Try target: "spatial" for very messy or skewed documents

Chunks are too large or too small

Adjust minCharacters and maxCharacters in chunkingStrategy.options if you are using type: "section" with target: "markdown".
Try different chunking types (page vs section vs document)
Consider using pageRanges to process fewer pages per request

Tables not parsing correctly

Enable tableHeaderContinuationEnabled for multi-page tables
Try targetFormat: "html" for complex table structures
Consider target: "markdown" for better table structure
Ensure tables: { enabled: true } in block options

Processing is too slow

Try these optimizations in order of impact:

Disable agenticOcrEnabled (most significant speedup - eliminates AI-based corrections)
Set figures: { enabled: false } (eliminates AI-based figure analysis)
Disable signatureDetectionEnabled
Use chunkingStrategy: "document" or "page" instead of "section" (reduces CPU overhead)
Use pageRanges to process fewer pages

Error Response Format

When an error occurs, the API returns a structured error response with the following fields:

code string A specific error code that identifies the type of error.

message string A human-readable description of the error.

requestId string A unique identifier for the request, useful for troubleshooting.

retryable boolean Indicates whether retrying the request might succeed.

Custom Error Codes

We provide custom error codes to make it easier for your system to know what happened in case of a failure. There will also be a retryable=true|false field in the response body, but you can also find a breakdown below. Most errors are not retryable and are client errors related to the file provided for parsing.

Error Code	Description	Retryable
`INVALID_CONFIG_OPTIONS`	Invalid combination of options in the incoming config.	❌
`UNABLE_TO_DOWNLOAD_FILE`	The system could not download the file from the provided URL, likely means your presigned url is expired, or malformed somehow.	❌
`FILE_TYPE_NOT_SUPPORTED`	The file type is not supported for parsing.	❌
`FILE_SIZE_TOO_LARGE`	The file exceeds the maximum allowed size.	❌
`CORRUPT_FILE`	The file is corrupt and cannot be parsed.	❌
`OCR_ERROR`	An error occurred in the OCR system. This is a rare error code and would indicate downtime, so requests can be retried. We’d suggest applying a retry with backoff for this error.	✅
`PASSWORD_PROTECTED_FILE`	The file is password protected and cannot be parsed.	❌
`FAILED_TO_CONVERT_TO_PDF`	The system could not convert the file to PDF format.	❌
`FAILED_TO_GENERATE_TARGET_FORMAT`	The system could not generate the requested target format.	❌
`INTERNAL_ERROR`	An unexpected internal error occurred. We’d suggest applying a retry with backoff for this error as it likely a result of some outage.	✅

HTTP error codes

Corresponding http error codes for different types of failures. We generally recommend relying on our custom error codes for programmatic handling.

400 Bad Request

Returned when:

Required fields are missing (e.g., file)
fileUrl is not provided in the file object
The provided fileUrl is invalid
The config contains invalid values (e.g., unsupported target format or chunking strategy)
The file type is not supported
The file size is too large

401 Unauthorized

Returned when:

The API token is missing
The API token is invalid

403 Forbidden

Returned when:

The authenticated workspace doesn’t have permission to use the parse functionality
The API token doesn’t have sufficient permissions

422 Unprocessable Entity

Returned when:

The file is corrupt and cannot be parsed
The file is password protected
The file could not be converted to PDF
The system failed to generate the target format

500 Internal Server Error

Returned when:

An OCR error occurs
A chunking error occurs
Any other unexpected error occurs during parsing

Handling Errors

Here are examples of how to handle errors from the Parse API:

1 const axios = require("axios");
2 
3 const parseDocument = async () => {
4   try {
5     const response = await axios.post(
6       "https://api.extend.ai/parse",
7       {
8         file: {
9           fileName: "example.pdf",
10           fileUrl: "https://example.com/documents/example.pdf",
11         },
12         config: {
13           target: "markdown",
14         },
15       },
16       {
17         headers: {
18           Authorization: "Bearer <API_TOKEN>",
19           "Content-Type": "application/json",
20         },
21       }
22     );
23 
24     console.log("Document parsed successfully:", response.data);
25     return response.data;
26   } catch (error) {
27     if (error.response) {
28       const { code, message, requestId, retryable } = error.response.data;
29       
30       // Handle specific error codes
31       switch (code) {
32         case "FILE_TYPE_NOT_SUPPORTED":
33           console.error("Unsupported file type. Please use a supported format.");
34           break;
35         case "PASSWORD_PROTECTED_FILE":
36           console.error("The file is password protected. Please provide an unprotected file.");
37           break;
38         case "CORRUPT_FILE":
39           console.error("The file is corrupt and cannot be processed.");
40           break;
41         case "FILE_SIZE_TOO_LARGE":
42           console.error("The file is too large. Please reduce the file size.");
43           break;
44         default:
45           console.error(`Error (${code}): ${message}`);
46       }
47       
48       // Log request ID for troubleshooting
49       console.error(`Request ID: ${requestId}`);
50       
51       // Potentially retry if the error is retryable
52       if (retryable) {
53         console.log("This error is retryable. Consider retrying the request.");
54       }
55     } else {
56       console.error("Network error:", error.message);
57     }
58     
59     throw error;
60   }
61 };

Using Parsed Output

Working with Chunks

When to use chunk.content vs. chunk.blocks

Example: Extracting specific content types

Example: Reconstructing content with custom formatting

Spatial Information

Markdown vs Spatial

Configuration Options

Target Format

target

Chunking Strategy

chunkingStrategy.type

chunkingStrategy.options

Block Options

Figures

Tables

Text

Advanced Options

advancedOptions.agenticOcrEnabled

advancedOptions.pageRotationEnabled

advancedOptions.pageRanges

Response Type

Complete Configuration Examples

Example 1: Optimized for RAG Pipeline

Example 2: High-Volume Processing (Performance-Optimized)

Example 3: Complex Legal Documents

Example 4: Handwritten Forms

Best Practices

Performance Optimization

Troubleshooting

Poor quality OCR results

Chunks are too large or too small

Tables not parsing correctly

Processing is too slow

Error Response Format

Custom Error Codes

HTTP error codes

400 Bad Request

401 Unauthorized

403 Forbidden

422 Unprocessable Entity

500 Internal Server Error

Handling Errors

When to use `chunk.content` vs. `chunk.blocks`

`target`

`chunkingStrategy.type`

`chunkingStrategy.options`

`advancedOptions.agenticOcrEnabled`

`advancedOptions.pageRotationEnabled`

`advancedOptions.pageRanges`