Parse File

The Parse endpoint allows you to convert documents into structured, machine-readable formats with fine-grained control over the parsing process. This endpoint is ideal for extracting cleaned document content to be used as context for downstream processing, e.g. RAG pipelines, custom ingestion pipelines, embeddings classification, etc.

Unlike processor and workflow runs, parsing is a synchronous endpoint and returns the parsed content in the response. Expected latency depends primarily on file size. This makes it suitable for workflows where you need immediate access to document content without waiting for asynchronous processing.

For a deeper guide on how to use the output of this endpoint, jump to Using Parsed Output.

Node.js
1const axios = require("axios");
2
3const parseDocument = async () => {
4 try {
5 const response = await axios.post(
6 "https://api-prod.extend.app/parse",
7 {
8 file: {
9 fileName: "example.pdf",
10 fileUrl: "https://example.com/documents/example.pdf",
11 },
12 config: {
13 target: "markdown",
14 chunkingStrategy: {
15 type: "page",
16 },
17 blockOptions: {
18 figures: {
19 enabled: true,
20 figureImageClippingEnabled: true,
21 },
22 tables: {
23 enabled: true,
24 },
25 text: {
26 enabled: true,
27 styleFormattingEnabled: true,
28 },
29 },
30 },
31 },
32 {
33 headers: {
34 Authorization: "Bearer <API_TOKEN>",
35 "Content-Type": "application/json",
36 },
37 }
38 );
39
40 console.log("Document parsed successfully:", response.data);
41 } catch (error) {
42 console.error("Error:", error.response?.data || error.message);
43 }
44};
45
46parseDocument();

Using Parsed Output

The Parse API returns document content in a structured format that provides both high-level formatted content and detailed block-level information. Understanding how to work with this output will help you get the most value from the parsing service.

Working with Chunks

Each chunk (currently only page-level chunks are supported) contains two key properties:

  1. content: A fully formatted representation of the entire chunk in the target format (e.g., markdown). This is ready to use as-is if you need the complete formatted content of a page.

  2. blocks: An array of individual content blocks that make up the chunk, each with its own formatting, position information, and metadata.

When to use chunk.content vs. chunk.blocks

  • Use chunk.content when:

    • You need the complete, properly formatted content of a page, already doing the logical placement of blocks (e.g. grouping markdown sections and placing spatially, etc)
    • You want to display or process the document content as a whole (and can just combine all chunk.content values)
    • You’re integrating with systems that expect formatted text (e.g., markdown processors)
  • Use chunk.blocks when:

    • You need to work with specific elements of the document (e.g., only tables or figures)
    • You need spatial information about where content appears on the page, perhaps to build citation systems
    • You’re building a UI that shows or highlights specific document elements

Example: Extracting specific content types

1// Extract all tables from a document
2function extractTables(parseResult) {
3 const tables = [];
4
5 parseResult.chunks.forEach(chunk => {
6 chunk.blocks.forEach(block => {
7 if (block.type === 'table') {
8 tables.push({
9 content: block.content,
10 pageNumber: block.metadata.pageNumber,
11 position: block.boundingBox
12 });
13 }
14 });
15 });
16
17 return tables;
18}
19
20// Extract all figures with their images
21function extractFigures(parseResult) {
22 const figures = [];
23
24 parseResult.chunks.forEach(chunk => {
25 chunk.blocks.forEach(block => {
26 if (block.type === 'figure' && block.details.imageUrl) {
27 figures.push({
28 caption: block.content,
29 imageUrl: block.details.imageUrl,
30 figureType: block.details.figureType,
31 pageNumber: block.metadata.pageNumber
32 });
33 }
34 });
35 });
36
37 return figures;
38}

Example: Reconstructing content with custom formatting

1// Extract headings and their content to create a table of contents
2function createTableOfContents(parseResult) {
3 const toc = [];
4
5 parseResult.chunks.forEach(chunk => {
6 chunk.blocks.forEach(block => {
7 if (block.type === 'heading' || block.type === 'section_heading') {
8 toc.push({
9 title: block.content,
10 pageNumber: block.metadata.pageNumber
11 });
12 }
13 });
14 });
15
16 return toc;
17}

Spatial Information

Each block contains spatial information in the form of a polygon (precise outline) and a simplified boundingBox. This information can be used to:

  • Highlight specific content in a document viewer
  • Create visual overlays on top of the original document
  • Understand the reading order and layout of the document
1// Create highlight coordinates for a document viewer
2function createHighlights(parseResult, searchTerm) {
3 const highlights = [];
4
5 parseResult.chunks.forEach(chunk => {
6 chunk.blocks.forEach(block => {
7 if (block.type === 'text' && block.content.includes(searchTerm)) {
8 highlights.push({
9 pageNumber: block.metadata.pageNumber,
10 boundingBox: block.boundingBox
11 });
12 }
13 });
14 });
15
16 return highlights;
17}

By leveraging both the formatted content and the structured block information, you can build powerful document processing workflows that combine the convenience of formatted text with the precision of block-level access.

Error Response Format

When an error occurs, the API returns a structured error response with the following fields:

code string A specific error code that identifies the type of error.

message string A human-readable description of the error.

requestId string A unique identifier for the request, useful for troubleshooting.

retryable boolean Indicates whether retrying the request might succeed.

Custom Error Codes

The API may return the following specific error codes:

Custom Error Codes

We provide custom error codes to make it easier for your system to know what happened in case of a failure. There will also be a retryable=true|false field in the response body, but you can also find a breakdown below. Most errors are not retryable and are client errors related to the file provided for parsing.

Error CodeDescriptionRetryable
INVALID_CONFIG_OPTIONSInvalid combination of options in the incoming config.
UNABLE_TO_DOWNLOAD_FILEThe system could not download the file from the provided URL, likely means your presigned url is expired, or malformed somehow.
FILE_TYPE_NOT_SUPPORTEDThe file type is not supported for parsing.
FILE_SIZE_TOO_LARGEThe file exceeds the maximum allowed size.
CORRUPT_FILEThe file is corrupt and cannot be parsed.
OCR_ERRORAn error occurred in the OCR system. This is a rare error code and would indicate downtime, so requests can be retried. We’d suggest applying a retry with backoff for this error.
PASSWORD_PROTECTED_FILEThe file is password protected and cannot be parsed.
FAILED_TO_CONVERT_TO_PDFThe system could not convert the file to PDF format.
FAILED_TO_GENERATE_TARGET_FORMATThe system could not generate the requested target format.
INTERNAL_ERRORAn unexpected internal error occurred. We’d suggest applying a retry with backoff for this error as it likely a result of some outage.

HTTP error codes

Corresponding http error codes for different types of failures. We generally recommend relying on our custom error codes for programmatic handling.

400 Bad Request

Returned when:

  • Required fields are missing (e.g., file)
  • Neither fileUrl nor fileBase64 is provided in the file object
  • The provided fileUrl is invalid
  • The provided fileBase64 is invalid
  • The config contains invalid values (e.g., unsupported target format or chunking strategy)
  • The file type is not supported
  • The file size is too large

401 Unauthorized

Returned when:

  • The API token is missing
  • The API token is invalid

403 Forbidden

Returned when:

  • The authenticated workspace doesn’t have permission to use the parse functionality
  • The API token doesn’t have sufficient permissions

422 Unprocessable Entity

Returned when:

  • The file is corrupt and cannot be parsed
  • The file is password protected
  • The file could not be converted to PDF
  • The system failed to generate the target format

500 Internal Server Error

Returned when:

  • An OCR error occurs
  • A chunking error occurs
  • Any other unexpected error occurs during parsing

Handling Errors

Here are examples of how to handle errors from the Parse API:

1const axios = require("axios");
2
3const parseDocument = async () => {
4 try {
5 const response = await axios.post(
6 "https://api-prod.extend.app/parse",
7 {
8 file: {
9 fileName: "example.pdf",
10 fileUrl: "https://example.com/documents/example.pdf",
11 },
12 config: {
13 target: "markdown",
14 },
15 },
16 {
17 headers: {
18 Authorization: "Bearer <API_TOKEN>",
19 "Content-Type": "application/json",
20 },
21 }
22 );
23
24 console.log("Document parsed successfully:", response.data);
25 return response.data;
26 } catch (error) {
27 if (error.response) {
28 const { code, message, requestId, retryable } = error.response.data;
29
30 // Handle specific error codes
31 switch (code) {
32 case "FILE_TYPE_NOT_SUPPORTED":
33 console.error("Unsupported file type. Please use a supported format.");
34 break;
35 case "PASSWORD_PROTECTED_FILE":
36 console.error("The file is password protected. Please provide an unprotected file.");
37 break;
38 case "CORRUPT_FILE":
39 console.error("The file is corrupt and cannot be processed.");
40 break;
41 case "FILE_SIZE_TOO_LARGE":
42 console.error("The file is too large. Please reduce the file size.");
43 break;
44 default:
45 console.error(`Error (${code}): ${message}`);
46 }
47
48 // Log request ID for troubleshooting
49 console.error(`Request ID: ${requestId}`);
50
51 // Potentially retry if the error is retryable
52 if (retryable) {
53 console.log("This error is retryable. Consider retrying the request.");
54 }
55 } else {
56 console.error("Network error:", error.message);
57 }
58
59 throw error;
60 }
61};