> ## Documentation Index
> Fetch the complete documentation index at: https://docs.contextual.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Start Extraction Job

> Start a new structured extraction job.

Extracts structured data from a PDF document using AI models based on a JSON Schema. The `document_id` and `schema_id` must be valid UUIDs of previously uploaded documents and created schemas.

How It Works:

1. Document Processing: The system analyzes the PDF document and splits it into manageable sections
2. Schema Analysis: Your JSON schema is parsed and validated for supported features
3. Extraction: The pipeline processes each section and extracts data according to your schema
4. Validation: Extracted data is validated against your schema to ensure constraints and requirements are met
5. Results: Structured data is returned in the format defined by your schema

Supported Models:

1. `gemini-2.5-flash`: Fast, cost-effective model for most use cases (default)
2. `gemini-2.5-pro`: More powerful model for complex extractions

Configuration Options:

Basic Settings:

1. `model`: AI model to use for extraction
2. `max_num_concurrent_requests`: Number of parallel processing requests (1-20)
3. `validate_response_schema`: Whether to validate extracted data against schema

Advanced Settings:

1. `per_key_attribution`: Enable detailed reasoning and confidence scores for each field
2. `temperature`: Control randomness in AI responses (0.0-2.0)
3. `seed`: Set random seed for reproducible results
4. `enable_thinking`: Show AI reasoning process
5. `n_max_retries`: Maximum retry attempts for failed requests

Streaming Mode (`stream=true`):

When `stream=true`, the endpoint returns an SSE stream instead of a job ID. Results are streamed as subtrees complete, enabling real-time progress visibility in the frontend.

- Response: SSE stream (`text/event-stream`) with events: `start`, `subtree_result`, `subtree_error`, `agentic_array_progress`
- Job records are still created and results persisted (retrievable via `GET /jobs/{job_id}`)
- Always durable: Uses Temporal workflow—job continues running even if client disconnects
- Automatic retries: Temporal handles activity failures and retries automatically
- Use case: Interactive frontend UI where users want real-time extraction progress

SSE Event Types:

1. `start`: Emitted once with `job_id`, `total_subtrees`, and `start_time`
2. `subtree_result`: Emitted per successful subtree with `extracted_result`, `attributions`, `usage_metadata`
3. `subtree_error`: Emitted per failed subtree with `error_message` and `error_type`

See `STREAMING-DESIGN.md` for full protocol specification.

Job Lifecycle:

1. `pending`: Job is queued and waiting to start
2. `processing`: AI is actively extracting data from the document
3. `completed`: Extraction finished successfully with results
4. `failed`: Extraction failed due to an error
5. `cancelled`: Job was cancelled before completion

Monitoring Progress:

Use the `GET /jobs/{job_id}` endpoint to check job status and progress:

1. `completion_percentage`: Overall progress (0-100)
2. `current_step`: Current processing stage
3. `fields_processed`: Number of schema fields completed
4. `total_fields`: Total number of schema fields to process

Getting Results:

Once a job is completed, use `GET /jobs/{job_id}/results` to retrieve:

1. `results`: Extracted data conforming to your schema
2. `metadata`: Processing information, costs, and performance metrics
3. `attributions`: (if enabled) Reasoning and confidence scores for each field

Example Request:
```json
{
  "document_id": "123e4567-e89b-12d3-a456-426614174000",
  "schema_id": "987fcdeb-51a2-43d1-b456-426614174000",
  "config": {
    "model": "gemini-2.5-flash",
    "per_key_attribution": true,
    "temperature": 0.1,
    "enable_thinking": true
    "additional_instructions": "Extract information only about company X. Do not include information about company Y."
  }
}
```

Example Response:
```json
{
  "job_id": "456e7890-e89b-12d3-a456-426614174000",
  "status": "pending",
  "created_at": "2025-01-11T18:35:00Z",
  "estimated_completion": "2025-01-11T18:40:00Z"
}
```

Error Handling:

Document and schema validation occurs before job creation for both streaming and non-streaming paths:

- `400`: Bad Request (invalid `document_id` or `schema_id` format)
- `404`: Not Found (document or schema not found)
- `422`: Unprocessable Entity (invalid schema definition)
- `429`: Too Many Requests (user has exceeded concurrency limit of 5 active jobs)
- `500`: Internal Server Error (processing error)



## OpenAPI

````yaml api-reference/openapi.json post /extract/jobs
openapi: 3.1.0
info:
  title: Endpoints
  version: '1.0'
servers:
  - url: https://api.contextual.ai/v1
security:
  - BearerAuth: []
paths:
  /extract/jobs:
    post:
      tags:
        - /extract
      summary: Start Extraction Job
      description: >-
        Start a new structured extraction job.


        Extracts structured data from a PDF document using AI models based on a
        JSON Schema. The `document_id` and `schema_id` must be valid UUIDs of
        previously uploaded documents and created schemas.


        How It Works:


        1. Document Processing: The system analyzes the PDF document and splits
        it into manageable sections

        2. Schema Analysis: Your JSON schema is parsed and validated for
        supported features

        3. Extraction: The pipeline processes each section and extracts data
        according to your schema

        4. Validation: Extracted data is validated against your schema to ensure
        constraints and requirements are met

        5. Results: Structured data is returned in the format defined by your
        schema


        Supported Models:


        1. `gemini-2.5-flash`: Fast, cost-effective model for most use cases
        (default)

        2. `gemini-2.5-pro`: More powerful model for complex extractions


        Configuration Options:


        Basic Settings:


        1. `model`: AI model to use for extraction

        2. `max_num_concurrent_requests`: Number of parallel processing requests
        (1-20)

        3. `validate_response_schema`: Whether to validate extracted data
        against schema


        Advanced Settings:


        1. `per_key_attribution`: Enable detailed reasoning and confidence
        scores for each field

        2. `temperature`: Control randomness in AI responses (0.0-2.0)

        3. `seed`: Set random seed for reproducible results

        4. `enable_thinking`: Show AI reasoning process

        5. `n_max_retries`: Maximum retry attempts for failed requests


        Streaming Mode (`stream=true`):


        When `stream=true`, the endpoint returns an SSE stream instead of a job
        ID. Results are streamed as subtrees complete, enabling real-time
        progress visibility in the frontend.


        - Response: SSE stream (`text/event-stream`) with events: `start`,
        `subtree_result`, `subtree_error`, `agentic_array_progress`

        - Job records are still created and results persisted (retrievable via
        `GET /jobs/{job_id}`)

        - Always durable: Uses Temporal workflow—job continues running even if
        client disconnects

        - Automatic retries: Temporal handles activity failures and retries
        automatically

        - Use case: Interactive frontend UI where users want real-time
        extraction progress


        SSE Event Types:


        1. `start`: Emitted once with `job_id`, `total_subtrees`, and
        `start_time`

        2. `subtree_result`: Emitted per successful subtree with
        `extracted_result`, `attributions`, `usage_metadata`

        3. `subtree_error`: Emitted per failed subtree with `error_message` and
        `error_type`


        See `STREAMING-DESIGN.md` for full protocol specification.


        Job Lifecycle:


        1. `pending`: Job is queued and waiting to start

        2. `processing`: AI is actively extracting data from the document

        3. `completed`: Extraction finished successfully with results

        4. `failed`: Extraction failed due to an error

        5. `cancelled`: Job was cancelled before completion


        Monitoring Progress:


        Use the `GET /jobs/{job_id}` endpoint to check job status and progress:


        1. `completion_percentage`: Overall progress (0-100)

        2. `current_step`: Current processing stage

        3. `fields_processed`: Number of schema fields completed

        4. `total_fields`: Total number of schema fields to process


        Getting Results:


        Once a job is completed, use `GET /jobs/{job_id}/results` to retrieve:


        1. `results`: Extracted data conforming to your schema

        2. `metadata`: Processing information, costs, and performance metrics

        3. `attributions`: (if enabled) Reasoning and confidence scores for each
        field


        Example Request:

        ```json

        {
          "document_id": "123e4567-e89b-12d3-a456-426614174000",
          "schema_id": "987fcdeb-51a2-43d1-b456-426614174000",
          "config": {
            "model": "gemini-2.5-flash",
            "per_key_attribution": true,
            "temperature": 0.1,
            "enable_thinking": true
            "additional_instructions": "Extract information only about company X. Do not include information about company Y."
          }
        }

        ```


        Example Response:

        ```json

        {
          "job_id": "456e7890-e89b-12d3-a456-426614174000",
          "status": "pending",
          "created_at": "2025-01-11T18:35:00Z",
          "estimated_completion": "2025-01-11T18:40:00Z"
        }

        ```


        Error Handling:


        Document and schema validation occurs before job creation for both
        streaming and non-streaming paths:


        - `400`: Bad Request (invalid `document_id` or `schema_id` format)

        - `404`: Not Found (document or schema not found)

        - `422`: Unprocessable Entity (invalid schema definition)

        - `429`: Too Many Requests (user has exceeded concurrency limit of 5
        active jobs)

        - `500`: Internal Server Error (processing error)
      operationId: start_extraction_job_extract_jobs_post
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/ExtractJobRequest'
      responses:
        '200':
          description: Successful Response
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ExtractJobResponse'
        '422':
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/HTTPValidationError'
components:
  schemas:
    ExtractJobRequest:
      properties:
        document_id:
          type: string
          title: Document Id
          description: ID of the document to extract from
        schema_id:
          type: string
          title: Schema Id
          description: ID of the schema to use for extraction
        config:
          anyOf:
            - $ref: '#/components/schemas/ExtractConfig'
            - type: 'null'
          description: Configuration options for the extraction process
        stream:
          type: boolean
          title: Stream
          description: >-
            Whether to stream the results as they become available. If true, the
            response will be a stream of JSON objects.
          default: false
      additionalProperties: false
      type: object
      required:
        - document_id
        - schema_id
      title: ExtractJobRequest
      description: Request model for starting a structured extraction job.
    ExtractJobResponse:
      properties:
        job_id:
          type: string
          title: Job Id
          description: Unique ID of the extraction job
        status:
          $ref: '#/components/schemas/DocumentStatusEnum'
          description: Current status of the job
        created_at:
          type: string
          title: Created At
          description: Timestamp when the job was created
        estimated_completion:
          type: string
          title: Estimated Completion
          description: Estimated completion time
      type: object
      required:
        - job_id
        - status
        - created_at
        - estimated_completion
      title: ExtractJobResponse
      description: Response model for extraction job creation.
    HTTPValidationError:
      properties:
        detail:
          items:
            $ref: '#/components/schemas/ValidationError'
          type: array
          title: Detail
      type: object
      title: HTTPValidationError
    ExtractConfig:
      properties:
        model:
          type: string
          enum:
            - gemini-2.5-flash
            - gemini-2.5-pro
            - gemini-3-pro-preview
            - gemini-3-flash-preview
          title: Model
          description: >-
            Base model for extraction. 'gemini-2.5-flash' is faster and cheaper,
            while 'gemini-2.5-pro' is more accurate and expensive.
          default: gemini-2.5-flash
          examples:
            - gemini-2.5-flash
            - gemini-2.5-pro
            - gemini-3-pro-preview
            - gemini-3-flash-preview
        per_key_attribution:
          type: boolean
          title: Per Key Attribution
          description: >-
            Set to include attributions for each extracted field including
            reasoning, confidence scores, and page numbers. This option
            increases processing time and cost.
          default: false
          examples:
            - false
            - true
        temperature:
          anyOf:
            - type: number
              maximum: 2
              minimum: 0
            - type: 'null'
          title: Temperature
          description: >-
            Sampling temperature (0.0-2.0). Lower values make output more
            deterministic and consistent, higher values vary more. Leave as None
            for model defaults.
          examples:
            - 0
            - 0.1
            - 0.5
            - 1
            - 2
        enable_thinking:
          type: boolean
          title: Enable Thinking
          description: >-
            Enable reasoning which can improve accuracy but increases processing
            time and cost.
          default: true
          examples:
            - true
            - false
        additional_instructions:
          anyOf:
            - type: string
            - type: 'null'
          title: Additional Instructions
          description: >-
            Additional instructions to the model. This is a string that will be
            appended to the prompt. It is useful for providing additional
            context or instructions to the model.
        enable_agentic_array_extraction:
          type: boolean
          title: Enable Agentic Array Extraction
          description: >-
            If True, will use agentic array extraction to extract the array of
            objects when possible. Requires isolate_arrays=True in
            splitter_config. One can also enable agentic array extraction for
            specific arrays by setting agenticArrayExtractionMaxItemsPerTurn > 0
            for the array in the JSON schema.
          default: false
      additionalProperties: false
      type: object
      title: ExtractConfig
      description: >-
        Configuration for structured extraction. Most settings have sensible
        defaults, but you can customize them for your specific use case.
    DocumentStatusEnum:
      type: string
      enum:
        - pending
        - processing
        - retrying
        - completed
        - failed
        - cancelled
      title: DocumentStatusEnum
    ValidationError:
      properties:
        loc:
          items:
            anyOf:
              - type: string
              - type: integer
          type: array
          title: Location
        msg:
          type: string
          title: Message
        type:
          type: string
          title: Error Type
      type: object
      required:
        - loc
        - msg
        - type
      title: ValidationError
  securitySchemes:
    BearerAuth:
      type: http
      scheme: bearer
      bearerFormat: API Key

````