Tuning and Evaluation

This guide covers the next steps of tuning and evaluating your Agent. Make sure you’ve gone through the Beginner’s Guide first.

For a hands on experience, go through our Colab Notebook:

Open Notebook in Colab

Tune

Fine-tuning an agent on domain-specific data helps the agent learn where to focus its attention and how to better interpret domain-specific information. When combined, the approaches are mutually reinforcing: RAG provides the up-to-date knowledge, while fine-tuning optimizes how that knowledge is processed and applied.

Fine-tuning of RAG Agents has been proven very effective across our customer deployments and research benchmarks. We’ve provided a powerful set of APIs that enable you to fine-tune Contextual RAG Agents to your data.

Step 1: Create a tune job

To create a tune job, you need to provide a training file and an (optional) test file. The file(s) must be in JSONL or CSV format with four required fields (see below). If no test file is provided, the API will automatically perform a train-test split on the training file. Your data files must have the following fields:

prompt

string

The question you are asking the Agent.

reference

string

The gold-standard response.

knowledge

list[string]

The list of retrieved chunks used to generate the gold-standard response.

guideline

string

Guidelines for the Agent’s output. You can write a few lines on what you expect the Agent’s response to look like, which will help guide its learning process. If you do not have any special guidelines, you can just use the System Prompt in your agent configuration.

{"prompt": "What are the key impacts of climate change?", "reference": "Climate change has several major impacts: rising global temperatures, sea level rise, extreme weather events, and disruption of ecosystems. These changes affect agriculture, human health, and infrastructure worldwide.", "guideline": "Provide a clear, concise summary of major climate change impacts with specific examples.", "knowledge": ["IPCC AR6 Summary: Global surface temperature was 1.09°C higher in 2011-2020 than 1850-1900", "NASA data shows sea levels rising at 3.3mm per year", "WHO report on climate change health impacts 2023"]}
{"prompt": "How do I prepare for a job interview?", "reference": "To prepare for a job interview: research the company, practice common questions, prepare relevant examples of your experience, dress professionally, and bring extra copies of your resume.", "guideline": "Give actionable interview preparation steps.", "knowledge": ["Harvard Business Review: Top Interview Preparation Strategies", "Career counseling best practices guide"]}

You can then run the following command:

Python

#%pip install contextual-client
from contextual import ContextualAI

CONTEXTUAL_API_KEY="INSERT_KEY_HERE"
REQUEST_URL="https://api.contextual.ai/v1"

client = ContextualAI(
    api_key=CONTEXTUAL_API_KEY,
    base_url=REQUEST_URL
)

with open(trainset_path, 'rb') as training_file, open(evalset_path, 'rb') as test_file: #replace paths
  tune_response = client.agents.tune.create(
    agent_id=agent_id, #replace
    training_file=training_file,
    test_file=test_file
  )
print(f"Tune job created. ID: {tune_response.id}")

Shell

curl --request 'POST' \
  --url 'https://api.contextual.ai/v1/agents/{agent_id}/tune' \
  --header 'accept: application/json' \
  --header 'Content-Type: multipart/form-data' \
  --header 'Authorization: Bearer $API_KEY' \
  --form 'training_file=@{$file_path}' \
  --form 'test_file=@{$file_path}'

# Replace {agent_id}, $API_KEY, {$file_path}

If the command is successful, you’ll be returned a job_id for the tune job. Keep in mind that tuning will take some time to complete.

After creating a tune job with locally uploaded files, the train and test files will automatically be saved as Datasets for future use. When creating a future tune job, you can simply use the Dataset names as your input parameter. You can manage these Datasets with the /datasets/tune and /datasets/evaluation endpoint.

Step 2: Check the status of the tune job

You can check the status of the tune job by passing in the agent_id and job_id.

Python

response = client.agents.tune.jobs.metadata(
    agent_id=agent_id, #replace
    job_id=tune_response.id, #replace
)
response

Shell

curl --request 'GET' \
  --url 'https://api.contextual.ai/v1/agents/{agent_id}/tune/jobs/{job_id}/metadata' \
  --header 'accept: application/json' \
  --header 'Authorization: Bearer $API_KEY'

# Replace {agent_id}, {job_id}, $API_KEY

When the job is completed, the status will change from processing to completed. The response payload will contain the tuned model_id which you will use to activate your tuned model, as well as evaluation data from the test split of your tuning dataset. Here is what the response payload looks like:

JSON

TuneJobMetadata(job_status='completed',
                evaluation_results=None,
                model_id='registry/tuned-model-101',
                id='e44661f0-bagb-4919-b0df-bada36a31',
                evaluation_metadata={'status': 'completed',
                                    'metrics': {'equivalence_score': {'score': 0.873}},
                                    'job_metadata': {'num_predictions': 200,
                                                      'num_failed_predictions': 0,
                                                      'num_successful_predictions': 200,
                                                      'num_processed_predictions': 0},
                                    'dataset_name': 'eval-results-101',
                                    'model_name': 'registry/tuned-model-101',
                                    'tune_job_id': 'e44661f0-bagb-4919-b0df-bada36a31'})

We store your row-by-row evaluation results in a Dataset that you can find in the response payload (in this example, it is eval-results-101). You can use the /datasets/evaluate API to inspect your row-by-row results.

Step 3: Activate the tuned model

Before you can use the tuned model, you need to activate it for your Agent. You can do so by editing the configuration of your Agent and passing in the tuned model_id. Currently, we allow a maximum of 3 tuned models to be activated per tenant.

client.agents.update(agent_id=agent_id, #replace
                    llm_model_id=model_id #replace (e.g. 'registry/tuned-model-101')
                    )

print("Agent updated with tuned model")

Shell

curl --request 'PUT' \
  --url 'https://api.contextual.ai/v1/agents/{agent_id}' \
  --header 'accept: application/json' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer $API_KEY' \
  --data '{
  "llm_model_id": "$model_id"
}'

# Replace {agent_id}, $API_KEY, and $model_id with the ID of your tuned model (example: registry/tuned-model-101)

The activation might take a moment to complete.

Step 4: Query your tuned model!

After you have activated the tuned model, you can now query it with the usual command. The tuned model is automatically set as the default for your Agent, so you don’t have to fill the llm_model_id field.

query_result = client.agents.query.create(
    agent_id=agent_id, #replace
    messages=[{
        "content": "what was the sales for Apple in 2022", #replace
        "role": "user"
    }]
)
print(query_result.message.content)

Shell

curl --request POST \
    --url https://api.contextual.ai/v1/agents/{agent_id}/query \
    --header 'accept: application/json' \
    --header 'authorization: Bearer $API_KEY' \
    --header 'content-type: application/json' \
    --data '{
  "stream": false,
  "messages": [
    {
      "role": "user",
      "content": "What is the revenue of Apple?"
    }
  ]
}
'

# Replace {agent_id} and $API_KEY

Step 5: Deactivate your tuned model (if needed)

We support a maximum of 3 activated models per tenant. As such, you might want to deactivate an existing tuned model to make way for another. To do so, you will have to edit the config of the Agent where that existing model is activated, and change the llm_model_id to "default".

client.agents.update(agent_id=agent_id, #replace
                    llm_model_id="default"
                    )

print("Tuned model deactivated.")

curl --request 'PUT' \
  --url 'https://api.contextual.ai/v1/agents/{agent_id}' \
  --header 'accept: application/json' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer $API_KEY' \
  --data '{
  "llm_model_id": "default"
}'

# Replace {agent_id} and $API_KEY

Evaluation

Evaluation assesses your model’s performance and identifies areas of weaknesses. It can create a positive feedback loop with Tuning for continuous improvements.

Our Evaluation endpoints allow you to evaluate your Agent using a set of prompts (questions) and reference (gold) answers. We currently support two metrics: equivalence and groundedness.

Equivalence: Uses a Language Model as a judge to evaluate if the Agent’s response is equivalent to the Reference (or gold-standard answer).
Groundedness: Decomposes the Agent’s response into claims, and then uses a Language Model to evaluate if the claims are grounded in the retrieved documents.

Step 1: Create an evaluation job

To create an evaluation job, you need to provide an evaluation set. The file(s) must be in CSV format with two required fields:

prompt

string

The question you are asking the Agent.

reference

string

The gold-standard response.

CSV

{
 "prompt": "What was the sales of Apple at the end of Q3 2022?",
 "reference": "Apple's sales was 100 million in the quarter ending Aug 31, 2022."
}

Use the following command to create your evaluation job. You will need to pass in your agent_id and file_path to your evaluation set.

Python

with open(evalset_path, 'rb') as eval_file: #replace file path
    response = client.agents.evaluate.create(
        agent_id=agent_id, #replace
        metrics=["equivalence", "groundedness"], #choose what metric you want
        evalset_file=eval_file,
        llm_model_id=response.model_name #optional, only include if you are evaluating a tuned model that hasn't been activated
    )
print(f"Eval job created. ID: {response.id}")

Shell

curl --request 'POST' \
  --url 'https://api.contextual.ai/v1/agents/{agent_id}/evaluate' \
  --header 'accept: application/json' \
  --header 'Content-Type: multipart/form-data' \
  --header 'Authorization: Bearer $API_KEY' \
  --form 'metrics[]=equivalence' \
  --form 'metrics[]=groundedness' \
  --form 'evalset_file=@{$file_path};type=text/csv'

# Replace {agent_id}, $API_KEY, {$file_path}
# Choose which metrics you want (equivalence or groundedness) by deleting the corresponding line

If the command is successful, you’ll be returned a job_id for the evaluation job. Keep in mind that evaluation will take some time to complete.

Step 2: Check the status of your evaluation job

You can check the status of your evaluation job by passing in the agent_id and job_id.

Python

eval_status = client.agents.evaluate.jobs.metadata(
    agent_id=agent_id, #replace
    job_id=eval_job_id #replace
)
eval_status

Shell

curl --request 'GET' \
  --url 'https://api.contextual.ai/v1/agents/{agent_id}/evaluate/jobs/{job_id}/metadata' \
  --header 'accept: application/json' \
  --header 'Authorization: Bearer $API_KEY'

# Replace {agent_id}, {job_id}, $API_KEY

When the job is completed, the status will change from processing to completed. The response payload will contain evaluation metrics and metadata. Here is what the response payload looks like:

JSON

{
  "status": "completed",
  "metrics": {
    "equivalence_score": { "score": 0.762 },
    "groundedness_score": { "score": 0.829 }
  },
  "job_metadata": {
    "num_predictions": 150,
    "num_failed_predictions": 0,
    "num_successful_predictions": 150
  },
  "dataset_name": "evalresults-101"
}

The metrics field contains the evaluation scores obtained by the model. job_metadata contains information on the evaluation performed, including the number of successful/failed predictions and the Dataset where your row-by-row evaluation results are stored.

Step 3: View your evaluation results

In Step 2, you will get a dataset_name when your evaluation job has completed. In the example above, this is evalresults-101. You can then view your raw evaluation results (equivalence and/or groundedness scores for each question-response pair) with the /datasets/evaluate endpoint.

Python

eval_results = client.agents.datasets.evaluate.retrieve(
    dataset_name=eval_status.dataset_name, #replace
    agent_id=agent_id #replace
)

Shell

curl --request GET \
     --url https://api.contextual.ai/v1/agents/{agent_id}/datasets/evaluate/{dataset_name} \
     --header 'accept: application/octet-stream' \
     --header 'authorization: Bearer $API_KEY'

# Remember to replace {agent_id}, {dataset_name}, $API_KEY

For a guided walkthrough of the evaluation process, see the Colab Notebook linked at the top of the page.

LMUnit

In addition to the standard evaluation methods described above, the Contextual Platform also provides another method of evaluation via natural language unit tests using the /lmunit endpoint. To understand the use cases of /lmunit, please read our blogpost.

LMUnit enables targeted evaluation of specific criteria that you care about, allowing you to evolve and optimize your Agents.

Step 1: Define your evaluation criteria

Come up with criteria for what constitutes a good response in the context of your agent. The criteria can be about the content, form, or style of a response:

Is there a style or tone you want the agent to maintain?
Do good answers exhibit a specific reasoning pattern?
Should responses cover specific content areas?

Step 2: Create testable statements

Translate your criteria into specific, clear, and testable statements or questions. For example:

Does the response maintain a professional style?
Does the response impartially cover different opinions or perspectives that exist on a question?
Does the response mention US Federal Reserve policy if the question is about the rate of inflation?

Step 3: Use the LMUnit endpoint

Use the /lmunit endpoint to evaluate a given query-response pair. Scores are reported on a 1-5 scale, with higher scores indicating better satisfaction of the test criteria.

Python

# Python example for LMUnit
response = client.lmunit.create(
    query="I remember learning about an ex prisoner who was brought to America to help train the soldiers. But the details escape me. Can anyone provide details to who he was?",
    response="Those clues are kind of vague, but one possible candidate might be Casimir Pulaski. He was an effective cavalry officer who was embroiled in the chaos of Poland in the later 18th c. and fought on a losing side, but while he was tried and condemned and his possessions confiscated, he'd fled to France by then. So, \"ex prisoner\" is not quite correct. But he did indeed help train American cavalry—and irritated quite a few who served with him with his imperious manner. If you heard about him in the US, it might be because there are a lot of towns named after him, and he became quite a popular hero to later Polish-Americans. Pienkos, A. (1976). A Bicentennial Look at Casimir Pulaski: Polish, American and Ethnic Folk Hero. Polish American Studies, 33(1), 5–17. http://www.jstor.org/stable/20147942",
    unit_test="Is the response helpful and aligned with the spirit of what the prompt was asking for?"
)
print(f"LMUnit score: {response.score}")

Shell

curl --request POST \
     --url https://api.contextual.ai/v1/lmunit \
     --header 'accept: application/json' \
     --header 'authorization: Bearer $API_KEY' \
     --header 'content-type: application/json' \
     --data '
{
  "query": "I remember learning about an ex prisoner who was brought to America to help train the soldiers. But the details escape me. Can anyone provide details to who he was?",

  "response": "Those clues are kind of vague, but one possible candidate might be Casimir Pulaski. He was an effective cavalry officer who was embroiled in the chaos of Poland in the later 18th c. and fought on a losing side, but while he was tried and condemned and his possessions confiscated, he'd fled to France by then. So, \"ex prisoner\" is not quite correct. But he did indeed help train American cavalry—and irritated quite a few who served with him with his imperious manner. If you heard about him in the US, it might be because there are a lot of towns named after him, and he became quite a popular hero to later Polish-Americans. Pienkos, A. (1976). A Bicentennial Look at Casimir Pulaski: Polish, American and Ethnic Folk Hero. Polish American Studies, 33(1), 5–17. http://www.jstor.org/stable/20147942",

  "unit_test": "Is the response helpful and aligned with the spirit of what the prompt was asking for?"
}
'

Note: Remember to:

Replace $API_KEY with your API key.
Replace the query, response, and unit_test fields with a query-response pair generated from your app and your test criteria.
You may need to truncate lengthy responses to keep within the current character limitations for this API.

If your request is successful, you’ll receive a score that indicates how well your response meets the specified criteria.

🎉 That was a quick spin-through our tune and eval endpoints! To learn more about our APIs and their capabilities, visit docs.contextual.ai. We look forward to seeing what you build with our platform 🏗️

Getting Started

Resources

SDKs

Open Notebook in Colab

Tune

Step 1: Create a tune job

Step 2: Check the status of the tune job

Step 3: Activate the tuned model

Step 4: Query your tuned model!

Step 5: Deactivate your tuned model (if needed)

Evaluation

Step 1: Create an evaluation job

Step 2: Check the status of your evaluation job

Step 3: View your evaluation results

LMUnit

Step 1: Define your evaluation criteria

Step 2: Create testable statements

Step 3: Use the LMUnit endpoint

Getting Started

Resources

SDKs

Open Notebook in Colab

​Tune

​Step 1: Create a tune job

​Step 2: Check the status of the tune job

​Step 3: Activate the tuned model

​Step 4: Query your tuned model!

​Step 5: Deactivate your tuned model (if needed)

​Evaluation

​Step 1: Create an evaluation job

​Step 2: Check the status of your evaluation job

​Step 3: View your evaluation results

​LMUnit

​Step 1: Define your evaluation criteria

​Step 2: Create testable statements

​Step 3: Use the LMUnit endpoint

Tune

Step 1: Create a tune job

Step 2: Check the status of the tune job

Step 3: Activate the tuned model

Step 4: Query your tuned model!

Step 5: Deactivate your tuned model (if needed)

Evaluation

Step 1: Create an evaluation job

Step 2: Check the status of your evaluation job

Step 3: View your evaluation results

LMUnit

Step 1: Define your evaluation criteria

Step 2: Create testable statements

Step 3: Use the LMUnit endpoint