Start Extraction Job

curl --request POST \
  --url https://api.contextual.ai/v1/extract/jobs \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "document_id": "<string>",
  "schema_id": "<string>",
  "config": {
    "model": "gemini-2.5-flash",
    "per_key_attribution": false,
    "temperature": 0,
    "enable_thinking": true,
    "additional_instructions": "<string>",
    "enable_agentic_array_extraction": false
  },
  "stream": false
}
'

{
  "job_id": "<string>",
  "status": "pending",
  "created_at": "<string>",
  "estimated_completion": "<string>"
}

/extract

Start Extraction Job

Start a new structured extraction job.

Extracts structured data from a PDF document using AI models based on a JSON Schema. The document_id and schema_id must be valid UUIDs of previously uploaded documents and created schemas.

How It Works:

Document Processing: The system analyzes the PDF document and splits it into manageable sections
Schema Analysis: Your JSON schema is parsed and validated for supported features
Extraction: The pipeline processes each section and extracts data according to your schema
Validation: Extracted data is validated against your schema to ensure constraints and requirements are met
Results: Structured data is returned in the format defined by your schema

Supported Models:

gemini-2.5-flash: Fast, cost-effective model for most use cases (default)
gemini-2.5-pro: More powerful model for complex extractions

Configuration Options:

Basic Settings:

model: AI model to use for extraction
max_num_concurrent_requests: Number of parallel processing requests (1-20)
validate_response_schema: Whether to validate extracted data against schema

Advanced Settings:

per_key_attribution: Enable detailed reasoning and confidence scores for each field
temperature: Control randomness in AI responses (0.0-2.0)
seed: Set random seed for reproducible results
enable_thinking: Show AI reasoning process
n_max_retries: Maximum retry attempts for failed requests

Streaming Mode (stream=true):

When stream=true, the endpoint returns an SSE stream instead of a job ID. Results are streamed as subtrees complete, enabling real-time progress visibility in the frontend.

Response: SSE stream (text/event-stream) with events: start, subtree_result, subtree_error, agentic_array_progress
Job records are still created and results persisted (retrievable via GET /jobs/{job_id})
Always durable: Uses Temporal workflow—job continues running even if client disconnects
Automatic retries: Temporal handles activity failures and retries automatically
Use case: Interactive frontend UI where users want real-time extraction progress

SSE Event Types:

start: Emitted once with job_id, total_subtrees, and start_time
subtree_result: Emitted per successful subtree with extracted_result, attributions, usage_metadata
subtree_error: Emitted per failed subtree with error_message and error_type

See STREAMING-DESIGN.md for full protocol specification.

Job Lifecycle:

pending: Job is queued and waiting to start
processing: AI is actively extracting data from the document
completed: Extraction finished successfully with results
failed: Extraction failed due to an error
cancelled: Job was cancelled before completion

Monitoring Progress:

Use the GET /jobs/{job_id} endpoint to check job status and progress:

completion_percentage: Overall progress (0-100)
current_step: Current processing stage
fields_processed: Number of schema fields completed
total_fields: Total number of schema fields to process

Getting Results:

Once a job is completed, use GET /jobs/{job_id}/results to retrieve:

results: Extracted data conforming to your schema
metadata: Processing information, costs, and performance metrics
attributions: (if enabled) Reasoning and confidence scores for each field

Example Request:

{
  "document_id": "123e4567-e89b-12d3-a456-426614174000",
  "schema_id": "987fcdeb-51a2-43d1-b456-426614174000",
  "config": {
    "model": "gemini-2.5-flash",
    "per_key_attribution": true,
    "temperature": 0.1,
    "enable_thinking": true
    "additional_instructions": "Extract information only about company X. Do not include information about company Y."
  }
}

Example Response:

{
  "job_id": "456e7890-e89b-12d3-a456-426614174000",
  "status": "pending",
  "created_at": "2025-01-11T18:35:00Z",
  "estimated_completion": "2025-01-11T18:40:00Z"
}

Error Handling:

Document and schema validation occurs before job creation for both streaming and non-streaming paths:

400: Bad Request (invalid document_id or schema_id format)
404: Not Found (document or schema not found)
422: Unprocessable Entity (invalid schema definition)
429: Too Many Requests (user has exceeded concurrency limit of 5 active jobs)
500: Internal Server Error (processing error)

POST

extract

jobs

Start Extraction Job

curl --request POST \
  --url https://api.contextual.ai/v1/extract/jobs \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "document_id": "<string>",
  "schema_id": "<string>",
  "config": {
    "model": "gemini-2.5-flash",
    "per_key_attribution": false,
    "temperature": 0,
    "enable_thinking": true,
    "additional_instructions": "<string>",
    "enable_agentic_array_extraction": false
  },
  "stream": false
}
'

{
  "job_id": "<string>",
  "status": "pending",
  "created_at": "<string>",
  "estimated_completion": "<string>"
}

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

Request model for starting a structured extraction job.

document_id

string

required

ID of the document to extract from

schema_id

string

required

ID of the schema to use for extraction

config

ExtractConfig · object

Configuration options for the extraction process

Show child attributes

stream

boolean

default:false

Whether to stream the results as they become available. If true, the response will be a stream of JSON objects.

Response

Successful Response

Response model for extraction job creation.

job_id

string

required

Unique ID of the extraction job

status

enum<string>

required

Current status of the job

Available options:

pending,

processing,

retrying,

completed,

failed,

cancelled

created_at

string

required

Timestamp when the job was created

estimated_completion

string

required

Estimated completion time

List Jobs

Resume Streaming Extraction

/datastores

/datastores/{id}/documents

/agents

/contents

/agents/{id}/query

/lmunit

/users

/generate

/rerank

/parse

/extract

/workspaces

/billing

Start Extraction Job

Authorizations

Body

Response