Configuring an Extractor
Configuring Extraction
From the Extend home page:
- Navigate to the Studio by clicking on the βStudioβ tab in the left sidebar.
- Click the βCreate newβ button.

- Select β+ Extractorβ to create a new Extractor processor.
- You will be prompted to give your new Extractor a name. Enter a descriptive name and click βCreateβ.

- After naming, you will be redirected to the βBuildβ tab for your new Extractor, ready to define its schema.
Note that you can also create a processor by importing existing configurations:
- Import Processor: Directly import the configuration for a processor from a configuration file.
- Import JSON Schema (for Extractors): You can import settings from a JSON Schema file. This is useful if you have a pre-defined schema.
Builder
Once you have created an Extraction processor, navigate to the βBuildβ tab.

Configuring Properties
Defining your extraction schema involves adding and configuring properties. A βpropertyβ represents a piece of data you want the AI to find and extract (e.g., βinvoice_numberβ, βcustomer_nameβ, βtotal_amountβ).
To add and configure a property:
- Add Property: Click the β+β button in the schema builder section.
- Name: Assign a meaningful name for the property. This name is critical as itβs what the AI model uses to understand what to look for. Choose names that are semantically descriptive of the data.
- Description: Write a clear and concise description. This tells the AI how to identify and extract the information from a typical document. Good descriptions are vital for accurate extraction.
- Property Type: Select the appropriate Property Type that matches the data you expect (e.g., String, Number, Date). See βProperty Typesβ below for details.
- (Optional) Property Key for Model: By default, the βNameβ you provide is sent to the AI model. If you need to use a different internal identifier for your property key but want to send a more descriptive name to the model, you can specify this using the βProperty Nameβ field in the advanced settings for a property.

Property Types
The following property types are supported for the JSON Schema configuration:
Basic Types
These are the fundamental data types for your properties:
- String: Used for any sequence of text. Example: extracting a personβs name or a product description.
- Number: Used for numerical values, including decimals. Example: extracting an item quantity or a subtotal.
- Boolean: Used for true/false values. Example: indicating if a checkbox is marked or if an item is in stock.
- Integer: Used for whole number values (no decimals). Example: extracting the number of pages in a document.
- Enum: Used when a field must have one of a predefined set of specific string values. Example: a βstatusβ field that can only be βPendingβ, βApprovedβ, or βRejectedβ. You will define the allowed values when configuring this type.
- Object: Used to group several related properties together into a nested structure. Example: an βaddressβ object containing βstreetβ, βcityβ, and βzip_codeβ properties.
- Array: Used for a list of items, where each item can be of a specified type (including Objects). Example: a list of βline_itemsβ in an invoice, where each line_item is an Object containing βdescriptionβ, βquantityβ, and βpriceβ.
Custom Types
Custom types are extensions of the basic types, often Objects, with added validation, specific formatting expectations, and specialized processing logic tailored for common structured data.
- Date:
- Type: String
- Description: Represents a date. The AI will attempt to identify and extract dates, formatting them into the ISO 8601 standard (
YYYY-MM-DD). - Example: extracting a βdocument_dateβ or βdate_of_birthβ.
- Currency:
- Type: Object
- Description: Represents a monetary value along with its currency code.
- Structure:
amount(Number): The numerical value of the currency.iso_4217_currency_code(String): The three-letter ISO 4217 currency code (e.g., βUSDβ, βEURβ).
- Example: extracting a βtotal_amountβ from an invoice.
- Signature:
- Type: Object
- Description: Captures details related to a signature found on a document.
- Structure:
is_signed(Boolean): Indicates whether a signature is present.printed_name(String, optional): The printed name associated with the signature.signature_date(Date, optional): The date accompanying the signature, formatted as YYYY-MM-DD.title_or_role(String, optional): The job title or role of the signatory.
- Example: extracting details from a signature block on a contract.
Once youβve configured your schema using the Schema Builder, you can view the complete JSON Schema representation by clicking the βJSONβ toggle. This can be useful for understanding the underlying structure or for sharing the schema.

Using the Run tab
While the βBuildβ tab is excellent for initial setup and iterative changes to your processorβs configuration, the βRunβ tab is designed for testing your processor more extensively. Effective testing is key to refining your property names, descriptions, and types.
From this tab you can:
- Upload and run multiple files in a batch to see how your configuration performs across a diverse set of documents. (Supported file types can be found here).
- Select a specific published version of your processor to test, or use the current saved draft.
- Run an existing Evaluation Set if you have one, to get structured feedback on performance.

After running a batch, the results page will provide insights:

Key actions on the results page include:
- Assessing overall extraction coverage and average confidence scores.
- Examining individual files to see specific extracted values and their confidence, helping you pinpoint areas in your configuration that may need adjustment.
- Optionally, correcting or editing results and then saving the batch as a new Evaluation Set. This is particularly useful once your schema (the set of properties and their types) has stabilized.
Note on Evaluation Sets and Configuration Iteration: Itβs generally best to finalize the set of properties you are extracting before heavily investing in creating detailed Evaluation Sets. If you add or remove properties (a schema change), your existing Evaluation Sets might show misleading accuracy or coverage metrics until they are updated to match the new schema. Iterating on names and descriptions with draft versions and smaller test batches is often more efficient in the early stages.
Next Steps
Once you have iteratively configured and tested your processor and are satisfied with its performance, youβll want to publish it. Publishing makes your processor version available for use in live Workflows.
See the Publishing Processors page for detailed information on how to publish and manage processor versions.

