Start Extraction Job
Start a new structured extraction job.
Extracts structured data from a PDF document using AI models based on a JSON Schema. The document_id and schema_id must be valid UUIDs of previously uploaded documents and created schemas.
How It Works:
- Document Processing: The system analyzes the PDF document and splits it into manageable sections
- Schema Analysis: Your JSON schema is parsed and validated for supported features
- Extraction: The pipeline processes each section and extracts data according to your schema
- Validation: Extracted data is validated against your schema to ensure constraints and requirements are met
- Results: Structured data is returned in the format defined by your schema
Supported Models:
gemini-2.5-flash: Fast, cost-effective model for most use cases (default)gemini-2.5-pro: More powerful model for complex extractions
Configuration Options:
Basic Settings:
model: AI model to use for extractionmax_num_concurrent_requests: Number of parallel processing requests (1-20)validate_response_schema: Whether to validate extracted data against schema
Advanced Settings:
per_key_attribution: Enable detailed reasoning and confidence scores for each fieldtemperature: Control randomness in AI responses (0.0-2.0)seed: Set random seed for reproducible resultsenable_thinking: Show AI reasoning processn_max_retries: Maximum retry attempts for failed requests
Streaming Mode (stream=true):
When stream=true, the endpoint returns an SSE stream instead of a job ID. Results are streamed as subtrees complete, enabling real-time progress visibility in the frontend.
- Response: SSE stream (
text/event-stream) with events:start,subtree_result,subtree_error,agentic_array_progress - Job records are still created and results persisted (retrievable via
GET /jobs/{job_id}) - Always durable: Uses Temporal workflow—job continues running even if client disconnects
- Automatic retries: Temporal handles activity failures and retries automatically
- Use case: Interactive frontend UI where users want real-time extraction progress
SSE Event Types:
start: Emitted once withjob_id,total_subtrees, andstart_timesubtree_result: Emitted per successful subtree withextracted_result,attributions,usage_metadatasubtree_error: Emitted per failed subtree witherror_messageanderror_type
See STREAMING-DESIGN.md for full protocol specification.
Job Lifecycle:
pending: Job is queued and waiting to startprocessing: AI is actively extracting data from the documentcompleted: Extraction finished successfully with resultsfailed: Extraction failed due to an errorcancelled: Job was cancelled before completion
Monitoring Progress:
Use the GET /jobs/{job_id} endpoint to check job status and progress:
completion_percentage: Overall progress (0-100)current_step: Current processing stagefields_processed: Number of schema fields completedtotal_fields: Total number of schema fields to process
Getting Results:
Once a job is completed, use GET /jobs/{job_id}/results to retrieve:
results: Extracted data conforming to your schemametadata: Processing information, costs, and performance metricsattributions: (if enabled) Reasoning and confidence scores for each field
Example Request:
{
"document_id": "123e4567-e89b-12d3-a456-426614174000",
"schema_id": "987fcdeb-51a2-43d1-b456-426614174000",
"config": {
"model": "gemini-2.5-flash",
"per_key_attribution": true,
"temperature": 0.1,
"enable_thinking": true
"additional_instructions": "Extract information only about company X. Do not include information about company Y."
}
}
Example Response:
{
"job_id": "456e7890-e89b-12d3-a456-426614174000",
"status": "pending",
"created_at": "2025-01-11T18:35:00Z",
"estimated_completion": "2025-01-11T18:40:00Z"
}
Error Handling:
Document and schema validation occurs before job creation for both streaming and non-streaming paths:
400: Bad Request (invaliddocument_idorschema_idformat)404: Not Found (document or schema not found)422: Unprocessable Entity (invalid schema definition)429: Too Many Requests (user has exceeded concurrency limit of 5 active jobs)500: Internal Server Error (processing error)
Documentation Index
Fetch the complete documentation index at: https://docs.contextual.ai/llms.txt
Use this file to discover all available pages before exploring further.
Authorizations
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Body
Request model for starting a structured extraction job.
ID of the document to extract from
ID of the schema to use for extraction
Configuration options for the extraction process
Whether to stream the results as they become available. If true, the response will be a stream of JSON objects.
Response
Successful Response
Response model for extraction job creation.
Unique ID of the extraction job
Current status of the job
pending, processing, retrying, completed, failed, cancelled Timestamp when the job was created
Estimated completion time