Start a new structured extraction job.
Extracts structured data from a PDF document using AI models based on a JSON Schema. The document_id and schema_id must be valid UUIDs of previously uploaded documents and created schemas.
How It Works:
Supported Models:
gemini-2.5-flash: Fast, cost-effective model for most use cases (default)gemini-2.5-pro: More powerful model for complex extractionsConfiguration Options:
Basic Settings:
model: AI model to use for extractionmax_num_concurrent_requests: Number of parallel processing requests (1-20)validate_response_schema: Whether to validate extracted data against schemaAdvanced Settings:
per_key_attribution: Enable detailed reasoning and confidence scores for each fieldtemperature: Control randomness in AI responses (0.0-2.0)seed: Set random seed for reproducible resultsenable_thinking: Show AI reasoning processn_max_retries: Maximum retry attempts for failed requestsStreaming Mode (stream=true):
When stream=true, the endpoint returns an SSE stream instead of a job ID. Results are streamed as subtrees complete, enabling real-time progress visibility in the frontend.
text/event-stream) with events: start, subtree_result, subtree_error, agentic_array_progressGET /jobs/{job_id})SSE Event Types:
start: Emitted once with job_id, total_subtrees, and start_timesubtree_result: Emitted per successful subtree with extracted_result, attributions, usage_metadatasubtree_error: Emitted per failed subtree with error_message and error_typeSee STREAMING-DESIGN.md for full protocol specification.
Job Lifecycle:
pending: Job is queued and waiting to startprocessing: AI is actively extracting data from the documentcompleted: Extraction finished successfully with resultsfailed: Extraction failed due to an errorcancelled: Job was cancelled before completionMonitoring Progress:
Use the GET /jobs/{job_id} endpoint to check job status and progress:
completion_percentage: Overall progress (0-100)current_step: Current processing stagefields_processed: Number of schema fields completedtotal_fields: Total number of schema fields to processGetting Results:
Once a job is completed, use GET /jobs/{job_id}/results to retrieve:
results: Extracted data conforming to your schemametadata: Processing information, costs, and performance metricsattributions: (if enabled) Reasoning and confidence scores for each fieldExample Request:
{
"document_id": "123e4567-e89b-12d3-a456-426614174000",
"schema_id": "987fcdeb-51a2-43d1-b456-426614174000",
"config": {
"model": "gemini-2.5-flash",
"per_key_attribution": true,
"temperature": 0.1,
"enable_thinking": true
"additional_instructions": "Extract information only about company X. Do not include information about company Y."
}
}
Example Response:
{
"job_id": "456e7890-e89b-12d3-a456-426614174000",
"status": "pending",
"created_at": "2025-01-11T18:35:00Z",
"estimated_completion": "2025-01-11T18:40:00Z"
}
Error Handling:
Document and schema validation occurs before job creation for both streaming and non-streaming paths:
400: Bad Request (invalid document_id or schema_id format)404: Not Found (document or schema not found)422: Unprocessable Entity (invalid schema definition)429: Too Many Requests (user has exceeded concurrency limit of 5 active jobs)500: Internal Server Error (processing error)Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Request model for starting a structured extraction job.
ID of the document to extract from
ID of the schema to use for extraction
Configuration options for the extraction process
Whether to stream the results as they become available. If true, the response will be a stream of JSON objects.
Successful Response
Response model for extraction job creation.
Unique ID of the extraction job
Current status of the job
pending, processing, retrying, completed, failed, cancelled Timestamp when the job was created
Estimated completion time