Parse File
The Parse endpoint allows you to convert documents into structured, machine-readable formats with fine-grained control over the parsing process. This endpoint is ideal for extracting cleaned document content to be used as context for downstream processing, e.g. RAG pipelines, custom ingestion pipelines, embeddings classification, etc.
For a deeper guide on how to use the output of this endpoint, jump to Using Parsed Output.
Using Parsed Output
The Parse API returns document content in a structured format that provides both high-level formatted content and detailed block-level information. Understanding how to work with this output will help you get the most value from the parsing service.
Working with Chunks
Each chunk (currently only page-level chunks are supported) contains two key properties:
-
content
: A fully formatted representation of the entire chunk in the target format (e.g., markdown). This is ready to use as-is if you need the complete formatted content of a page. -
blocks
: An array of individual content blocks that make up the chunk, each with its own formatting, position information, and metadata.
When to use chunk.content
vs. chunk.blocks
-
Use
chunk.content
when:- You need the complete, properly formatted content of a page, already doing the logical placement of blocks (e.g. grouping markdown sections and placing spatially, etc)
- You want to display or process the document content as a whole (and can just combine all chunk.content values)
- You’re integrating with systems that expect formatted text (e.g., markdown processors)
-
Use
chunk.blocks
when:- You need to work with specific elements of the document (e.g., only tables or figures)
- You need spatial information about where content appears on the page, perhaps to build citation systems
- You’re building a UI that shows or highlights specific document elements
Example: Extracting specific content types
Example: Reconstructing content with custom formatting
Spatial Information
Each block contains spatial information in the form of a polygon
(precise outline) and a simplified boundingBox
. This information can be used to:
- Highlight specific content in a document viewer
- Create visual overlays on top of the original document
- Understand the reading order and layout of the document
By leveraging both the formatted content and the structured block information, you can build powerful document processing workflows that combine the convenience of formatted text with the precision of block-level access.
Configuration Options
Page Selection
You can specify a range of pages to process by passing the pageRanges
field in the advancedOptions
object.
- 1-based, inclusive page numbers
- Ranges can overlap or arrive out of order; the platform merges and sorts them automatically
- Omit the field or pass
[]
to process the full document (subject to global limits) - To clear any existing page ranges and process the full document (up to the default page limit), pass
[]
The default page limit is 300, and the maximum document size is 750 pages.
Response Type
You can specify the response type in the responseType
query parameter for the Parse File
and Get Parser Run
endpoints.
json
- Returns parsed outputs in the response bodyurl
- Return a presigned URL to the parsed content in the response body
Error Response Format
When an error occurs, the API returns a structured error response with the following fields:
code string
A specific error code that identifies the type of error.
message string
A human-readable description of the error.
requestId string
A unique identifier for the request, useful for troubleshooting.
retryable boolean
Indicates whether retrying the request might succeed.
Custom Error Codes
We provide custom error codes to make it easier for your system to know what happened in case of a failure. There will also be a retryable=true|false
field in the response body, but you can also find a breakdown below. Most errors are not retryable and are client errors related to the file provided for parsing.
HTTP error codes
Corresponding http error codes for different types of failures. We generally recommend relying on our custom error codes for programmatic handling.
400 Bad Request
Returned when:
- Required fields are missing (e.g.,
file
) - Neither
fileUrl
norfileBase64
is provided in the file object - The provided
fileUrl
is invalid - The provided
fileBase64
is invalid - The
config
contains invalid values (e.g., unsupported target format or chunking strategy) - The file type is not supported
- The file size is too large
401 Unauthorized
Returned when:
- The API token is missing
- The API token is invalid
403 Forbidden
Returned when:
- The authenticated workspace doesn’t have permission to use the parse functionality
- The API token doesn’t have sufficient permissions
422 Unprocessable Entity
Returned when:
- The file is corrupt and cannot be parsed
- The file is password protected
- The file could not be converted to PDF
- The system failed to generate the target format
500 Internal Server Error
Returned when:
- An OCR error occurs
- A chunking error occurs
- Any other unexpected error occurs during parsing
Handling Errors
Here are examples of how to handle errors from the Parse API: