This article describes the key parameters that are available when configuring RAG agents on the Contextual AI platform. Sensible defaults are applied for all settings, but they can be modified based on your preferences or to optimize performance against your data and query patterns.

Standard Parameters

  • Datastores (datastore_ids): Datastores are the knowledgebases that your agent can access when answering queries. Files uploaded into a datastore are processed using Contextual’s multi-modal document understanding pipeline, which prepares documents in ways optimized for end-to-end RAG performance. You must link at least one datastore to your agent, but you can specify more.
  • Generator Model (llm_model_id): Determines which generator model powers your agent. You can use either our default Grounded Langauge Model or a version that has been specifically tuned to your use case. Tuned models can only be used with the agents on which they were tuned.

System Prompts

Note on Adherence: Contextual has built agents to faithfully follow instructions. However, in some cases complete adherence cannot be guaranteed, especially where instructions are unclear, under-specified, or in conflict with other instructions you have given or our guardrails.

The System Prompts instruct the agent on how to respond to users’ queries given the retrieved knowledge. The appropriate prompt is passed, along with the user query and relevant retrievals, to the Generator Model at generation-time.

  • Core System Prompt (system_prompt): Defines how the agent interprets queries and generates responses. You can provide instructions about the agent’s persona and style, and the desired content and structure of the responses.
  • No Retrieval System Prompt (no_retrieval_system_prompt): Defines the agent’s behavior if, after the retrieval, reranking, and filter steps, no relevant knowledge has been identified. You can use this prompt to define boilerplate refusals, offer help and guidance, provide information about the document store, and specify other contextually-appropriate ways that the agent should respond.

Query Understanding

These settings affect if and how user queries are modified to improve retrieval performance and response generation. 

  • Enable Multi-turn (enable_multi_turn): Allows the agent to remember and reference previous parts of the conversation, making interactions feel more natural and continuous. When enabled, the user’s query will automatically be reformulated based on prior turns to resolve ambiguities. The conversation history is prepended to the query at generation-time.

Retrieval

These settings determine how the agent performs the initial retrieval from linked unstructured datastores.

  • Number of Retrieved Chunks (top_k_retrieved_chunks): The maximum number of chunks that should be retrieved from the linked datastore(s). For example, if set to 10, the agent will retrieve at most 10 relevant chunks to generate its response. The value ranges from 1 to 200, and the default value is 100.
  • Lexical Search Weight (lexical_alpha): When chunks are scored during retrieval (based on their relevance to the user query), this controls how much weight the scoring gives to exact keyword matches. A higher value means the agent will prioritize finding exact word matches, while a lower value allows for more flexible, meaning-based matching. The value ranges from 0 to 1, and we default to 0.1. You can increase this weight if exact terminology, specific entities, or specialized vocabulary is important for your use case.
  • Semantic Search Weight (semantic_alpha): When chunks are scored during retrieval (based on their relevance to the user query), this controls how much weight the scoring gives to semantic similarity. Semantic searching looks for text that conveys similar meaning and concepts, rather than just matching exact keywords or phrases. The value ranges from 0 to 1, and we default to 0.9. The value of the semantic search weight and lexical search weight must sum to 1.

Rerank and Filter

These settings affect how the agent reranks and filters chunks before passing them to the generator model.

  • Enable Reranking (enable_rerank): Allows the agent to take the initially retrieved document chunks and rerank them based on the provided instructions and user query. The top reranked chunks are passed on for filtering and final response generation.
    • Rerank instructions (rerank_instructions): Natural language instructions that describe your preferences for the ranking of chunks, such as prioritizing chunks from particular sources or time periods. Chunks will be rescored based on these instructions and the user query. If no instruction is given, the rerank scores are based only on relevance of chunks to the query.
    • Number of Reranked Chunks (top_k_reranked_chunks): The number of top reranked chunks that are passed on for generation.
    • Reranker score filter (reranker_score_filter_threshold): If the value is set to greater than 0, chunks with relevance scores below the set value will not be passed on. Input must be between 0 and 1.
  • Enable Filtering (enable_filter): Allows the agent to perform a final filtering step prior to generation. When enabled, chunks are checked against the filter prompt and irrelevant chunks are filtered out. This acts like a final quality control checkpoint, helping to ensure that only relevant chunks are passed to the generator. This filter can improve response accuracy and relevance, but also increase the false refusal rate if the configuration is too strict.
    • Filter Prompt (filter_prompt): Natural language instructions that describes the criteria for relevant and irrelevant chunks. It can be used in tandem with, or as an alternative to, the reranker score-based filtering.

Generate

These settings affect how the generator model produces responses.  

  • Max New Tokens (max_new_tokens): Controls the maximum length of the agent’s response. Defaults to 2,048 tokens.
  • Temperature (temperature): Controls how creative the agent’s responses are. A higher temperature means more creative and varied responses, while a lower temperature results in more consistent, predictable answers. It ranges from 0 to 1. Defaults to 0.
  • Top P (top_p): Similar to temperature, this parameter also controls response variety. It determines how many different word choices the agent considers when generating its response. Defaults to 0.9.
  • Frequency Penalty (frequency_penalty): Helps prevent repetition in responses by making the agent less likely to use words it has already used frequently. This helps ensure more natural, varied language. Defaults to 0.
  • Random Seed (seed): Controls randomness in how the agent selects the next tokens during text generation. Allows for reproducible results, which can be useful for testing.
  • Enable Groundedness Scores (calculate_groundedness). Enables the agent to provide groundedness scores as part of its response. When enabled, the agent identifies distinct claims in the response and assesses whether each one is grounded in the retrieved document chunks. Claims that are not grounded are shown in yellow in the UI. Defaults off.
  • Disable Commentary (avoid_commentary): Flag that indicates whether the Agent should only output strictly factual information grounded in the retrieved knowledge, instead of the complete response (which can include commentary, analysis, etc.).

Miscellaneous

  • Suggested queries (suggested_queries): Example queries that appear in the user interface when first interacting with the agent. You can provide both simple and complex examples to help users understand the full range of questions your agent can handle. This helps set user expectations and guides them toward effective interactions with the agent.