Extraction Block

The Extraction block serves as the eyes of your system, pulling specific information from documents using natural language instructions.

Understanding the Interface

The Extraction block consists of three simple components:

Prompt field: Where you write natural language instructions telling the system what to extract
Run Test button: Click this to test your extraction on the current document
Result field: Shows exactly what the system extracted based on your prompt

The process is straightforward: Write what you want → Test it → See what you get. If the result isn't quite right, refine your prompt and test again.

When Should You Write Custom Prompts?

You don’t need extraction prompts for every field in your documents; the field name is sufficient for accurate extraction in the majority of cases. However, there are some cases where it’s advisable to write your own custom prompts, which are instructions in plain English. Write custom prompts only for:

Critical fields that drive your workflow (invoice totals, vendor names, payment terms)
Problem fields where automatic extraction isn't accurate enough
Complex fields that appear in varying formats across documents

Most dates, simple numbers, and clearly labeled values work well without custom prompts. Start with essentials, test your workflow, then add more only where needed.

Example Scenario 1: Extracting Invoice Total

Consider a typical accounts payable scenario. Your company receives invoices from hundreds of vendors in various formats. Some list the total as "Amount Due," others as "Total Payable," and some bury it in dense paragraphs of terms and conditions. The Extraction block handles this variation intelligently.

To extract an invoice total, you would enter a prompt like:
"Extract the total amount due from this invoice."

You can also test the extraction process on other varied samples to ensure the prompt works correctly.

Example Scenario 2: Extracting Description of Services data from Table

The same extraction block that pulls individual fields can extract entire columns from tables in your documents. This is essential when dealing with invoices containing multiple line items, each with descriptions, quantities, prices, etc.

When to Use Column Extraction?
Use column extraction when you need to capture all entries from a specific table column - like extracting all product descriptions, all quantities, or all prices from an invoice.

How Column Extraction Works?
Click on the Extraction block. Column extraction uses the same interface as field extraction. In many cases, the system automatically extracts the entire column based on the column name alone - no custom prompt needed.

However, just like with fields, you can write custom prompts when you need more control. This becomes useful when documents contain multiple tables or when you need to exclude certain entries.

📝 Best Practices for Writing Prompts to achieve high accuracy

Be specific with your instructions:
❌ Vague: "Extract total"
✅ Precise: "Extract the subtotal before tax from the summary section"

1. Start Simple, Then Add Specificity

Most simple fields like totals or dates work well with basic prompts. However, some fields require refinement due to legitimate complexity.

Let's walk through a realistic example:

Extracting Tax ID Numbers (a field that varies significantly across vendors)

Initial prompt: "Extract tax ID number"
- Testing reveals: Some invoices show tax IDs in different formats (EIN: 12-3456789, Tax ID: 123456789, Federal ID Number: 12-3456789), which is sometimes missed by the system.
First refinement: "Extract tax identification number, which may be labeled as Tax ID, EIN, Federal ID, or TIN"
- Testing reveals: Now it finds the right field, but includes the label prefix like "EIN:" or "Tax ID:"
Final refinement: "Extract only the numeric tax ID value after labels like 'Tax ID:', 'EIN:', 'Federal ID:', or 'TIN:'. Return numbers and hyphens only, no label text"

Result: Clean, consistent tax ID values 12-3456789 ready for database matching

This example shows how refinement handles real document variations, not system limitations. Simple fields like "Extract invoice total" or "Extract invoice date" typically work on the first try. Complex fields benefit from iteration to handle the natural variety in business documents.

2. Specify Location When Possible

Documents often repeat similar labels. Guide the extraction by mentioning where to look:

"Extract vendor name from the top left section"
"Find delivery date in the middle section near shipping details"
"Get total amount from the bottom summary area"

3. Handle Multiple Occurrences

When a field appears multiple times, provide distinguishing context:

Problem: Document has three different "Date" fields
Solution: Write prompt saying "Extract the date labeled 'INVOICE DATE' near the document header, not the shipping date or due date"

4. Use Exclusion Criteria

When different types of data could be confused with what you want, explicitly exclude them:

Problem: Extracting "contact phone" might grab fax numbers, mobile numbers, and office phones
Solution: "Extract the main contact phone number, excluding any fax numbers or alternate phone lines"

5. Make AI learn from previous mistakes

Example of iterative correction:

First attempt: "Extract customer reference number"

Result: System extracts "PO-2024-1234" (but this is actually the internal PO number)

Refined prompt: "Extract customer reference number - this is NOT the PO number 'PO-2024-1234' but rather the customer's own reference like 'CR-5678' or 'CLIENT-REF-001'"

Result: Correctly extracts "CR-5678"

6. Define Output Format for Identification

Most formatting should be handled by Cleaning blocks after extraction. However, mentioning format in your extraction prompt can help when multiple similar values exist and format is the distinguishing feature.
When format truly helps disambiguate complex cases:

"Extract the internal batch code that follows format YYMMDD-XXX-A1, not the supplier batch codes that use DDMMYY-XXXX" (both are date-based codes in the quality control section, format is the only differentiator)
"Extract the primary SWIFT code in format XXXXUS33XXX (11 characters)" (distinguishes from 8-character SWIFT codes of intermediary banks listed in the same wire transfer section)