Parse File

The Parse endpoint allows you to convert documents into structured, machine-readable formats with fine-grained control over the parsing process. This endpoint is ideal for extracting cleaned document content to be used as context for downstream processing, e.g. RAG pipelines, custom ingestion pipelines, embeddings classification, etc.

For a deeper guide on how to use the output of this endpoint, jump to Using Parsed Output.

Choosing a target? See Markdown vs Spatial for guidance.

Node.js
1const axios = require("axios");
2
3const parseDocument = async () => {
4 try {
5 const response = await axios.post(
6 "https://api.extend.ai/parse",
7 {
8 file: {
9 fileName: "example.pdf",
10 fileUrl: "https://example.com/documents/example.pdf",
11 },
12 config: {
13 target: "markdown",
14 chunkingStrategy: {
15 type: "page",
16 },
17 blockOptions: {
18 figures: {
19 enabled: true,
20 figureImageClippingEnabled: true,
21 },
22 tables: {
23 tableHeaderContinuationEnabled: false,
24 },
25 text: {
26 signatureDetectionEnabled: true,
27 },
28 },
29 },
30 },
31 {
32 headers: {
33 Authorization: "Bearer <API_TOKEN>",
34 "Content-Type": "application/json",
35 },
36 }
37 );
38
39 console.log("Document parsed successfully:", response.data);
40 } catch (error) {
41 console.error("Error:", error.response?.data || error.message);
42 }
43};
44
45parseDocument();

Using Parsed Output

The Parse API returns document content in a structured format that provides both high-level formatted content and detailed block-level information. Understanding how to work with this output will help you get the most value from the parsing service.

Working with Chunks

Each chunk (currently only page-level chunks are supported) contains two key properties:

  1. content: A fully formatted representation of the entire chunk in the target format (e.g., markdown). This is ready to use as-is if you need the complete formatted content of a page.

  2. blocks: An array of individual content blocks that make up the chunk, each with its own formatting, position information, and metadata.

When to use chunk.content vs. chunk.blocks

  • Use chunk.content when:

    • You need the complete, properly formatted content of a page, already doing the logical placement of blocks (e.g. grouping markdown sections and placing spatially, etc)
    • You want to display or process the document content as a whole (and can just combine all chunk.content values)
    • You’re integrating with systems that expect formatted text (e.g., markdown processors)
  • Use chunk.blocks when:

    • You need to work with specific elements of the document (e.g., only tables or figures)
    • You need spatial information about where content appears on the page, perhaps to build citation systems
    • You’re building a UI that shows or highlights specific document elements

Example: Extracting specific content types

1// Extract all tables from a document
2function extractTables(parseResult) {
3 const tables = [];
4
5 parseResult.chunks.forEach(chunk => {
6 chunk.blocks.forEach(block => {
7 if (block.type === 'table') {
8 tables.push({
9 content: block.content,
10 pageNumber: block.metadata.pageNumber,
11 position: block.boundingBox
12 });
13 }
14 });
15 });
16
17 return tables;
18}
19
20// Extract all figures with their images
21function extractFigures(parseResult) {
22 const figures = [];
23
24 parseResult.chunks.forEach(chunk => {
25 chunk.blocks.forEach(block => {
26 if (block.type === 'figure' && block.details.imageUrl) {
27 figures.push({
28 caption: block.content,
29 imageUrl: block.details.imageUrl,
30 figureType: block.details.figureType,
31 pageNumber: block.metadata.pageNumber
32 });
33 }
34 });
35 });
36
37 return figures;
38}

Example: Reconstructing content with custom formatting

1// Extract headings and their content to create a table of contents
2function createTableOfContents(parseResult) {
3 const toc = [];
4
5 parseResult.chunks.forEach(chunk => {
6 chunk.blocks.forEach(block => {
7 if (block.type === 'heading' || block.type === 'section_heading') {
8 toc.push({
9 title: block.content,
10 pageNumber: block.metadata.pageNumber
11 });
12 }
13 });
14 });
15
16 return toc;
17}

Spatial Information

Each block contains spatial information in the form of a polygon (precise outline) and a simplified boundingBox. This information can be used to:

  • Highlight specific content in a document viewer
  • Create visual overlays on top of the original document
  • Understand the reading order and layout of the document
1// Create highlight coordinates for a document viewer
2function createHighlights(parseResult, searchTerm) {
3 const highlights = [];
4
5 parseResult.chunks.forEach(chunk => {
6 chunk.blocks.forEach(block => {
7 if (block.type === 'text' && block.content.includes(searchTerm)) {
8 highlights.push({
9 pageNumber: block.metadata.pageNumber,
10 boundingBox: block.boundingBox
11 });
12 }
13 });
14 });
15
16 return highlights;
17}

By leveraging both the formatted content and the structured block information, you can build powerful document processing workflows that combine the convenience of formatted text with the precision of block-level access.

Markdown vs Spatial

  • markdown: Clean, logical reading order using true markdown constructs (headings, lists, tables, checkboxes). Supports section-aware chunking and works best for LLMs and RAG.

  • spatial: Layout/position-preserving text that uses markdown elements for block types (e.g. tables, checkboxes) but is not strictly markdown due to tabs/whitespace used to preserve placement. Chunks are page-based only.

  • When to choose markdown:

    • Default choice for most documents
    • Better for multi‑column layouts (content is linearized into readable paragraphs)
    • Enables logical section chunking for improved retrieval
  • When to choose spatial:

    • Very messy, scanned, or handwritten docs (e.g. healthcare notes, skewed scans)
    • You need a near 1:1 text representation of the original layout
    • You rely on spatial consistency or vector/distance‑based clustering across documents
    • BOLs and similar scanned logistics documents often perform better

Tip: If unsure, start with markdown. Switch to spatial if you need layout fidelity or encounter scanned/skewed inputs where reading order is unreliable.

Configuration Options

The Parse API accepts a config object that controls how documents are parsed and processed. Configuration options are organized into several categories:

  • Target Format: Output format for parsed content (markdown vs spatial)
  • Chunking Strategy: How the document is divided into chunks
  • Block Options: Fine-grained control over parsing specific layout types
  • Advanced Options: OCR enhancements and page filtering

Target Format

target

Type: "markdown" | "spatial"
Default: "markdown"

Determines how content is extracted and formatted from the document. See Markdown vs Spatial above for detailed guidance on choosing between these options.

1{
2 "config": {
3 "target": "markdown"
4 }
5}

Chunking Strategy

chunkingStrategy.type

Type: "page" | "section" | "document"
Default: "page"

Determines the granularity of document chunking.

"page" - Creates a separate chunk for each page of the document. Compatible with both markdown and spatial targets.

"section" - Chunks the document into logical sections based on markdown structure (headings, subheadings). The parser ensures logical groups of content are preserved by never breaking markdown elements across chunks. Only works with target: "markdown". This is ideal for RAG systems where each chunk should be a complete semantic unit.

"document" - Treats the entire document as a single chunk. Use this for small documents or when you have custom downstream chunking requirements.

1{
2 "config": {
3 "chunkingStrategy": {
4 "type": "section"
5 }
6 }
7}

chunkingStrategy.options

Fine-grained control over chunk sizing. Only applies when using type: "section" with target: "markdown". These options are ignored for "page" and "document" chunking types.

minCharacters (number, default: 500) - The minimum number of characters per chunk. Small sections may be combined to meet this minimum.

maxCharacters (number, default: 5000) - The maximum number of characters per chunk. Long sections will be split at natural boundaries when possible.

1{
2 "config": {
3 "chunkingStrategy": {
4 "type": "section",
5 "options": {
6 "minCharacters": 500,
7 "maxCharacters": 2000
8 }
9 }
10 }
11}

Block Options

Fine-grained control over how specific content types are detected and formatted.

Figures

blockOptions.figures.enabled (boolean, default: true) - Enables or disables figure detection and parsing. When enabled, the parser uses a VLM to analyze and extract content from each figure. Note: This adds processing latency, especially for documents with many figures. Disable for fastest processing.

blockOptions.figures.figureImageClippingEnabled (boolean, default: true) - When enabled, extracts figure images from the document and uploads them to blob storage, providing presigned URLs in the output. Each figure is cropped from the page and saved as a PNG.

1{
2 "config": {
3 "blockOptions": {
4 "figures": {
5 "enabled": true,
6 "figureImageClippingEnabled": true
7 }
8 }
9 }
10}

Tables

blockOptions.tables.targetFormat ("markdown" | "html", default: "markdown") - Controls the output format for tables.

  • markdown: Human-readable pipe syntax, works well with LLMs
  • html: Preserves complex table structure (merged cells, rowspan, colspan)

blockOptions.tables.tableHeaderContinuationEnabled (boolean, default: false) - When enabled, automatically propagates table headers across page breaks. Useful for long tables spanning multiple pages where headers are only present on the first page.

1{
2 "config": {
3 "blockOptions": {
4 "tables": {
5 "targetFormat": "html",
6 "tableHeaderContinuationEnabled": true
7 }
8 }
9 }
10}

Text

blockOptions.text.signatureDetectionEnabled (boolean, default: true in API) - Enables advanced signature detection. When enabled, identifies handwritten signatures, initials, and signature blocks in the document.

1{
2 "config": {
3 "blockOptions": {
4 "text": {
5 "signatureDetectionEnabled": true
6 }
7 }
8 }
9}

Advanced Options

advancedOptions.agenticOcrEnabled

Type: boolean
Default: false

Enables agentic OCR - an advanced feature that uses a vision-language model (VLM) to review and correct low-confidence OCR results. The system automatically identifies text regions with low OCR confidence and applies AI-based corrections.

When to enable:

  • Handwritten documents or forms
  • Poor quality scans
  • Historical documents or faded text
  • Mixed print and handwritten content

Note: Increases latency, especially when every page is scanned.

1{
2 "config": {
3 "advancedOptions": {
4 "agenticOcrEnabled": true
5 }
6 }
7}

advancedOptions.pageRotationEnabled

Type: boolean
Default: true

Enables automatic page rotation detection and correction. The system detects if pages are rotated and automatically rotates them to the correct orientation before parsing.

1{
2 "config": {
3 "advancedOptions": {
4 "pageRotationEnabled": true
5 }
6 }
7}

advancedOptions.pageRanges

Type: Array<{ start: number, end: number }>
Default: [] (all pages)

Specifies which pages of the document to process. Page numbers are 1-based and ranges are inclusive. Ranges can overlap and be in any order - the system automatically merges and sorts them.

1{
2 "config": {
3 "advancedOptions": {
4 "pageRanges": [
5 { "start": 1, "end": 3 },
6 { "start": 10, "end": 10 }
7 ]
8 }
9 }
10}
  • 1-based, inclusive page numbers
  • Ranges can overlap or arrive out of order; the platform merges and sorts them automatically
  • Omit the field or pass [] to process the full document (subject to global limits)
  • You are only billed for pages actually processed

The default page limit is 300, and the maximum document size is 750 pages.

Response Type

You can specify the response type in the responseType query parameter for the Parse File and Get Parser Run endpoints.

  • json - Returns parsed outputs in the response body
  • url - Return a presigned URL to the parsed content in the response body

Complete Configuration Examples

Example 1: Optimized for RAG Pipeline

Section-based chunking with semantic boundaries:

1{
2 "config": {
3 "target": "markdown",
4 "chunkingStrategy": {
5 "type": "section",
6 "options": {
7 "minCharacters": 500,
8 "maxCharacters": 2000
9 }
10 },
11 "blockOptions": {
12 "figures": {
13 "enabled": true,
14 "figureImageClippingEnabled": false
15 },
16 "tables": {
17 "targetFormat": "html"
18 }
19 },
20 "advancedOptions": {
21 "pageRotationEnabled": true
22 }
23 }
24}

Example 2: High-Volume Processing (Performance-Optimized)

Performance-optimized configuration for fast, high-volume processing:

1{
2 "config": {
3 "target": "markdown",
4 "chunkingStrategy": {
5 "type": "page"
6 },
7 "blockOptions": {
8 "figures": {
9 "enabled": false
10 },
11 "tables": {
12 "targetFormat": "markdown"
13 },
14 "text": {
15 "signatureDetectionEnabled": false
16 }
17 },
18 "advancedOptions": {
19 "pageRotationEnabled": false,
20 "agenticOcrEnabled": false
21 }
22 }
23}

Maximum accuracy with signature detection and header continuation:

1{
2 "config": {
3 "target": "markdown",
4 "chunkingStrategy": {
5 "type": "section"
6 },
7 "blockOptions": {
8 "figures": {
9 "enabled": true,
10 "figureImageClippingEnabled": true
11 },
12 "tables": {
13 "targetFormat": "html",
14 "tableHeaderContinuationEnabled": true
15 },
16 "text": {
17 "signatureDetectionEnabled": true
18 }
19 },
20 "advancedOptions": {
21 "agenticOcrEnabled": true,
22 "pageRotationEnabled": true
23 }
24 }
25}

Example 4: Handwritten Forms

Optimized for handwritten or degraded documents:

1{
2 "config": {
3 "target": "spatial",
4 "chunkingStrategy": {
5 "type": "page"
6 },
7 "blockOptions": {
8 "figures": {
9 "enabled": false
10 },
11 "text": {
12 "signatureDetectionEnabled": true
13 }
14 },
15 "advancedOptions": {
16 "agenticOcrEnabled": true,
17 "pageRotationEnabled": true
18 }
19 }
20}

Best Practices

Performance Optimization

For fastest processing:

  • Disable agenticOcrEnabled (most significant speedup - avoids AI-based OCR corrections)
  • Set figures: { enabled: false } (avoids AI-based figure analysis)
  • Disable pageRotationEnabled if all pages are correctly oriented
  • Use chunkingStrategy: "document" (fastest) or "page" instead of "section" (section chunking adds CPU overhead during parsing to generate semantic sections)

For highest accuracy:

  • Set target: "markdown" with chunkingStrategy: "section"
  • Enable agenticOcrEnabled for handwritten/degraded documents
  • Enable tableHeaderContinuationEnabled for multi-page tables
  • Enable signatureDetectionEnabled for legal documents

Troubleshooting

Poor quality OCR results

  1. Enable agenticOcrEnabled: true for handwritten/degraded documents
  2. Ensure pageRotationEnabled: true for rotated pages
  3. Try target: "spatial" for very messy or skewed documents

Chunks are too large or too small

  1. Adjust minCharacters and maxCharacters in chunkingStrategy.options if you are using type: "section" with target: "markdown".
  2. Try different chunking types (page vs section vs document)
  3. Consider using pageRanges to process fewer pages per request

Tables not parsing correctly

  1. Enable tableHeaderContinuationEnabled for multi-page tables
  2. Try targetFormat: "html" for complex table structures
  3. Consider target: "markdown" for better table structure
  4. Ensure tables: { enabled: true } in block options

Processing is too slow

Try these optimizations in order of impact:

  1. Disable agenticOcrEnabled (most significant speedup - eliminates AI-based corrections)
  2. Set figures: { enabled: false } (eliminates AI-based figure analysis)
  3. Disable signatureDetectionEnabled
  4. Use chunkingStrategy: "document" or "page" instead of "section" (reduces CPU overhead)
  5. Use pageRanges to process fewer pages

Error Response Format

When an error occurs, the API returns a structured error response with the following fields:

code string A specific error code that identifies the type of error.

message string A human-readable description of the error.

requestId string A unique identifier for the request, useful for troubleshooting.

retryable boolean Indicates whether retrying the request might succeed.

Custom Error Codes

We provide custom error codes to make it easier for your system to know what happened in case of a failure. There will also be a retryable=true|false field in the response body, but you can also find a breakdown below. Most errors are not retryable and are client errors related to the file provided for parsing.

Error CodeDescriptionRetryable
INVALID_CONFIG_OPTIONSInvalid combination of options in the incoming config.
UNABLE_TO_DOWNLOAD_FILEThe system could not download the file from the provided URL, likely means your presigned url is expired, or malformed somehow.
FILE_TYPE_NOT_SUPPORTEDThe file type is not supported for parsing.
FILE_SIZE_TOO_LARGEThe file exceeds the maximum allowed size.
CORRUPT_FILEThe file is corrupt and cannot be parsed.
OCR_ERRORAn error occurred in the OCR system. This is a rare error code and would indicate downtime, so requests can be retried. We’d suggest applying a retry with backoff for this error.
PASSWORD_PROTECTED_FILEThe file is password protected and cannot be parsed.
FAILED_TO_CONVERT_TO_PDFThe system could not convert the file to PDF format.
FAILED_TO_GENERATE_TARGET_FORMATThe system could not generate the requested target format.
INTERNAL_ERRORAn unexpected internal error occurred. We’d suggest applying a retry with backoff for this error as it likely a result of some outage.

HTTP error codes

Corresponding http error codes for different types of failures. We generally recommend relying on our custom error codes for programmatic handling.

400 Bad Request

Returned when:

  • Required fields are missing (e.g., file)
  • fileUrl is not provided in the file object
  • The provided fileUrl is invalid
  • The config contains invalid values (e.g., unsupported target format or chunking strategy)
  • The file type is not supported
  • The file size is too large

401 Unauthorized

Returned when:

  • The API token is missing
  • The API token is invalid

403 Forbidden

Returned when:

  • The authenticated workspace doesn’t have permission to use the parse functionality
  • The API token doesn’t have sufficient permissions

422 Unprocessable Entity

Returned when:

  • The file is corrupt and cannot be parsed
  • The file is password protected
  • The file could not be converted to PDF
  • The system failed to generate the target format

500 Internal Server Error

Returned when:

  • An OCR error occurs
  • A chunking error occurs
  • Any other unexpected error occurs during parsing

Handling Errors

Here are examples of how to handle errors from the Parse API:

1const axios = require("axios");
2
3const parseDocument = async () => {
4 try {
5 const response = await axios.post(
6 "https://api.extend.ai/parse",
7 {
8 file: {
9 fileName: "example.pdf",
10 fileUrl: "https://example.com/documents/example.pdf",
11 },
12 config: {
13 target: "markdown",
14 },
15 },
16 {
17 headers: {
18 Authorization: "Bearer <API_TOKEN>",
19 "Content-Type": "application/json",
20 },
21 }
22 );
23
24 console.log("Document parsed successfully:", response.data);
25 return response.data;
26 } catch (error) {
27 if (error.response) {
28 const { code, message, requestId, retryable } = error.response.data;
29
30 // Handle specific error codes
31 switch (code) {
32 case "FILE_TYPE_NOT_SUPPORTED":
33 console.error("Unsupported file type. Please use a supported format.");
34 break;
35 case "PASSWORD_PROTECTED_FILE":
36 console.error("The file is password protected. Please provide an unprotected file.");
37 break;
38 case "CORRUPT_FILE":
39 console.error("The file is corrupt and cannot be processed.");
40 break;
41 case "FILE_SIZE_TOO_LARGE":
42 console.error("The file is too large. Please reduce the file size.");
43 break;
44 default:
45 console.error(`Error (${code}): ${message}`);
46 }
47
48 // Log request ID for troubleshooting
49 console.error(`Request ID: ${requestId}`);
50
51 // Potentially retry if the error is retryable
52 if (retryable) {
53 console.log("This error is retryable. Consider retrying the request.");
54 }
55 } else {
56 console.error("Network error:", error.message);
57 }
58
59 throw error;
60 }
61};