This guide covers the next steps of tuning and evaluating your Agent. Make sure you’ve gone through the Beginner’s Guide first.
For a hands on experience, go through our Colab Notebook:
Tune
Fine-tuning an agent on domain-specific data helps the agent learn where to focus its attention and how to better interpret domain-specific information. When combined, the approaches are mutually reinforcing: RAG provides the up-to-date knowledge, while fine-tuning optimizes how that knowledge is processed and applied.
Fine-tuning of RAG Agents has been proven very effective across our customer deployments and research benchmarks. We’ve provided a powerful set of APIs that enable you to fine-tune Contextual RAG Agents to your data.
Step 1: Create a tune job
To create a tune job, you need to provide a training file and an (optional) test file. The file(s) must be in JSONL
or CSV
format with four required fields (see below). If no test file is provided, the API will automatically perform a train-test split on the training file. Your data files must have the following fields:
The question you are asking the Agent.
The gold-standard response.
The list of retrieved chunks used to generate the gold-standard response.
Guidelines for the Agent’s output. You can write a few lines on what you
expect the Agent’s response to look like, which will help guide its learning
process. If you do not have any special guidelines, you can just use the
System Prompt in your agent configuration.
{"prompt": "What are the key impacts of climate change?", "reference": "Climate change has several major impacts: rising global temperatures, sea level rise, extreme weather events, and disruption of ecosystems. These changes affect agriculture, human health, and infrastructure worldwide.", "guideline": "Provide a clear, concise summary of major climate change impacts with specific examples.", "knowledge": ["IPCC AR6 Summary: Global surface temperature was 1.09°C higher in 2011-2020 than 1850-1900", "NASA data shows sea levels rising at 3.3mm per year", "WHO report on climate change health impacts 2023"]}
{"prompt": "How do I prepare for a job interview?", "reference": "To prepare for a job interview: research the company, practice common questions, prepare relevant examples of your experience, dress professionally, and bring extra copies of your resume.", "guideline": "Give actionable interview preparation steps.", "knowledge": ["Harvard Business Review: Top Interview Preparation Strategies", "Career counseling best practices guide"]}
You can then run the following command:
#%pip install contextual-client
from contextual import ContextualAI
CONTEXTUAL_API_KEY="INSERT_KEY_HERE"
REQUEST_URL="https://api.contextual.ai/v1"
client = ContextualAI(
api_key=CONTEXTUAL_API_KEY,
base_url=REQUEST_URL
)
with open(trainset_path, 'rb') as training_file, open(evalset_path, 'rb') as test_file: #replace paths
tune_response = client.agents.tune.create(
agent_id=agent_id, #replace
training_file=training_file,
test_file=test_file
)
print(f"Tune job created. ID: {tune_response.id}")
curl --request 'POST' \
--url 'https://api.contextual.ai/v1/agents/{agent_id}/tune' \
--header 'accept: application/json' \
--header 'Content-Type: multipart/form-data' \
--header 'Authorization: Bearer $API_KEY' \
--form 'training_file=@{$file_path}' \
--form 'test_file=@{$file_path}'
# Replace {agent_id}, $API_KEY, {$file_path}
If the command is successful, you’ll be returned a job_id
for the tune job. Keep in mind that tuning will take some time to complete.
After creating a tune job with locally uploaded files, the train and test
files will automatically be saved as Datasets
for future use. When creating
a future tune job, you can simply use the Dataset
names as your input
parameter. You can manage these Datasets
with the /datasets/tune
and
/datasets/evaluation
endpoint.
Step 2: Check the status of the tune job
You can check the status of the tune job by passing in the agent_id
and job_id
.
response = client.agents.tune.jobs.metadata(
agent_id=agent_id, #replace
job_id=tune_response.id, #replace
)
response
curl --request 'GET' \
--url 'https://api.contextual.ai/v1/agents/{agent_id}/tune/jobs/{job_id}/metadata' \
--header 'accept: application/json' \
--header 'Authorization: Bearer $API_KEY'
# Replace {agent_id}, {job_id}, $API_KEY
When the job is completed, the status will change from processing
to completed
. The response payload will contain the tuned model_id
which you will use to activate your tuned model, as well as evaluation data from the test split of your tuning dataset. Here is what the response payload looks like:
TuneJobMetadata(job_status='completed',
evaluation_results=None,
model_id='registry/tuned-model-101',
id='e44661f0-bagb-4919-b0df-bada36a31',
evaluation_metadata={'status': 'completed',
'metrics': {'equivalence_score': {'score': 0.873}},
'job_metadata': {'num_predictions': 200,
'num_failed_predictions': 0,
'num_successful_predictions': 200,
'num_processed_predictions': 0},
'dataset_name': 'eval-results-101',
'model_name': 'registry/tuned-model-101',
'tune_job_id': 'e44661f0-bagb-4919-b0df-bada36a31'})
We store your row-by-row evaluation results in a Dataset
that you can find in the response payload (in this example, it is eval-results-101
). You can use the /datasets/evaluate
API to inspect your row-by-row results.
Step 3: Activate the tuned model
Before you can use the tuned model, you need to activate it for your Agent. You can do so by editing the configuration of your Agent and passing in the tuned model_id
. Currently, we allow a maximum of 3 tuned models to be activated per tenant.
client.agents.update(agent_id=agent_id, #replace
llm_model_id=model_id #replace (e.g. 'registry/tuned-model-101')
)
print("Agent updated with tuned model")
curl --request 'PUT' \
--url 'https://api.contextual.ai/v1/agents/{agent_id}' \
--header 'accept: application/json' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $API_KEY' \
--data '{
"llm_model_id": "$model_id"
}'
# Replace {agent_id}, $API_KEY, and $model_id with the ID of your tuned model (example: registry/tuned-model-101)
The activation might take a moment to complete.
Step 4: Query your tuned model!
After you have activated the tuned model, you can now query it with the usual command. The tuned model is automatically set as the default for your Agent, so you don’t have to fill the llm_model_id
field.
query_result = client.agents.query.create(
agent_id=agent_id, #replace
messages=[{
"content": "what was the sales for Apple in 2022", #replace
"role": "user"
}]
)
print(query_result.message.content)
curl --request POST \
--url https://api.contextual.ai/v1/agents/{agent_id}/query \
--header 'accept: application/json' \
--header 'authorization: Bearer $API_KEY' \
--header 'content-type: application/json' \
--data '{
"stream": false,
"messages": [
{
"role": "user",
"content": "What is the revenue of Apple?"
}
]
}
'
# Replace {agent_id} and $API_KEY
Step 5: Deactivate your tuned model (if needed)
We support a maximum of 3 activated models per tenant. As such, you might want to deactivate an existing tuned model to make way for another. To do so, you will have to edit the config of the Agent where that existing model is activated, and change the llm_model_id
to "default"
.
client.agents.update(agent_id=agent_id, #replace
llm_model_id="default"
)
print("Tuned model deactivated.")
curl --request 'PUT' \
--url 'https://api.contextual.ai/v1/agents/{agent_id}' \
--header 'accept: application/json' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $API_KEY' \
--data '{
"llm_model_id": "default"
}'
# Replace {agent_id} and $API_KEY
Evaluation
Evaluation assesses your model’s performance and identifies areas of weaknesses. It can create a positive feedback loop with Tuning for continuous improvements.
Our Evaluation endpoints allow you to evaluate your Agent using a set of prompts
(questions) and reference
(gold) answers. We currently support two metrics: equivalence
and groundedness
.
Equivalence
: Uses a Language Model as a judge to evaluate if the Agent’s response is equivalent to the Reference (or gold-standard answer).
Groundedness
: Decomposes the Agent’s response into claims, and then uses a Language Model to evaluate if the claims are grounded in the retrieved documents.
Step 1: Create an evaluation job
To create an evaluation job, you need to provide an evaluation set. The file(s) must be in CSV
format with two required fields:
The question you are asking the Agent.
The gold-standard response.
{
"prompt": "What was the sales of Apple at the end of Q3 2022?",
"reference": "Apple's sales was 100 million in the quarter ending Aug 31, 2022."
}
Use the following command to create your evaluation job. You will need to pass in your agent_id
and file_path
to your evaluation set.
with open(evalset_path, 'rb') as eval_file: #replace file path
response = client.agents.evaluate.create(
agent_id=agent_id, #replace
metrics=["equivalence", "groundedness"], #choose what metric you want
evalset_file=eval_file,
llm_model_id=response.model_name #optional, only include if you are evaluating a tuned model that hasn't been activated
)
print(f"Eval job created. ID: {response.id}")
curl --request 'POST' \
--url 'https://api.contextual.ai/v1/agents/{agent_id}/evaluate' \
--header 'accept: application/json' \
--header 'Content-Type: multipart/form-data' \
--header 'Authorization: Bearer $API_KEY' \
--form 'metrics[]=equivalence' \
--form 'metrics[]=groundedness' \
--form 'evalset_file=@{$file_path};type=text/csv'
# Replace {agent_id}, $API_KEY, {$file_path}
# Choose which metrics you want (equivalence or groundedness) by deleting the corresponding line
If the command is successful, you’ll be returned a job_id
for the evaluation job. Keep in mind that evaluation will take some time to complete.
Step 2: Check the status of your evaluation job
You can check the status of your evaluation job by passing in the agent_id
and job_id
.
eval_status = client.agents.evaluate.jobs.metadata(
agent_id=agent_id, #replace
job_id=eval_job_id #replace
)
eval_status
curl --request 'GET' \
--url 'https://api.contextual.ai/v1/agents/{agent_id}/evaluate/jobs/{job_id}/metadata' \
--header 'accept: application/json' \
--header 'Authorization: Bearer $API_KEY'
# Replace {agent_id}, {job_id}, $API_KEY
When the job is completed, the status will change from processing
to completed
. The response payload will contain evaluation metrics and metadata. Here is what the response payload looks like:
{
"status": "completed",
"metrics": {
"equivalence_score": { "score": 0.762 },
"groundedness_score": { "score": 0.829 }
},
"job_metadata": {
"num_predictions": 150,
"num_failed_predictions": 0,
"num_successful_predictions": 150
},
"dataset_name": "evalresults-101"
}
The metrics
field contains the evaluation scores obtained by the model. job_metadata
contains information on the evaluation performed, including the number of successful/failed predictions and the Dataset
where your row-by-row evaluation results are stored.
Step 3: View your evaluation results
In Step 2, you will get a dataset_name
when your evaluation job has completed. In the example above, this is evalresults-101
. You can then view your raw evaluation results (equivalence and/or groundedness scores for each question-response pair) with the /datasets/evaluate
endpoint.
eval_results = client.agents.datasets.evaluate.retrieve(
dataset_name=eval_status.dataset_name, #replace
agent_id=agent_id #replace
)
curl --request GET \
--url https://api.contextual.ai/v1/agents/{agent_id}/datasets/evaluate/{dataset_name} \
--header 'accept: application/octet-stream' \
--header 'authorization: Bearer $API_KEY'
# Remember to replace {agent_id}, {dataset_name}, $API_KEY
For a guided walkthrough of the evaluation process, see the Colab Notebook linked at the top of the page.
LMUnit
In addition to the standard evaluation methods described above, the Contextual Platform also provides another method of evaluation via natural language unit tests using the /lmunit
endpoint. To understand the use cases of /lmunit
, please read our blogpost.
LMUnit enables targeted evaluation of specific criteria that you care about, allowing you to evolve and optimize your Agents.
Step 1: Define your evaluation criteria
Come up with criteria for what constitutes a good response in the context of your agent. The criteria can be about the content, form, or style of a response:
- Is there a style or tone you want the agent to maintain?
- Do good answers exhibit a specific reasoning pattern?
- Should responses cover specific content areas?
Step 2: Create testable statements
Translate your criteria into specific, clear, and testable statements or questions. For example:
- Does the response maintain a professional style?
- Does the response impartially cover different opinions or perspectives that exist on a question?
- Does the response mention US Federal Reserve policy if the question is about the rate of inflation?
Step 3: Use the LMUnit endpoint
Use the /lmunit
endpoint to evaluate a given query-response pair. Scores are reported on a 1-5 scale, with higher scores indicating better satisfaction of the test criteria.
# Python example for LMUnit
response = client.lmunit.create(
query="I remember learning about an ex prisoner who was brought to America to help train the soldiers. But the details escape me. Can anyone provide details to who he was?",
response="Those clues are kind of vague, but one possible candidate might be Casimir Pulaski. He was an effective cavalry officer who was embroiled in the chaos of Poland in the later 18th c. and fought on a losing side, but while he was tried and condemned and his possessions confiscated, he'd fled to France by then. So, \"ex prisoner\" is not quite correct. But he did indeed help train American cavalry—and irritated quite a few who served with him with his imperious manner. If you heard about him in the US, it might be because there are a lot of towns named after him, and he became quite a popular hero to later Polish-Americans. Pienkos, A. (1976). A Bicentennial Look at Casimir Pulaski: Polish, American and Ethnic Folk Hero. Polish American Studies, 33(1), 5–17. http://www.jstor.org/stable/20147942",
unit_test="Is the response helpful and aligned with the spirit of what the prompt was asking for?"
)
print(f"LMUnit score: {response.score}")
curl --request POST \
--url https://api.contextual.ai/v1/lmunit \
--header 'accept: application/json' \
--header 'authorization: Bearer $API_KEY' \
--header 'content-type: application/json' \
--data '
{
"query": "I remember learning about an ex prisoner who was brought to America to help train the soldiers. But the details escape me. Can anyone provide details to who he was?",
"response": "Those clues are kind of vague, but one possible candidate might be Casimir Pulaski. He was an effective cavalry officer who was embroiled in the chaos of Poland in the later 18th c. and fought on a losing side, but while he was tried and condemned and his possessions confiscated, he'd fled to France by then. So, \"ex prisoner\" is not quite correct. But he did indeed help train American cavalry—and irritated quite a few who served with him with his imperious manner. If you heard about him in the US, it might be because there are a lot of towns named after him, and he became quite a popular hero to later Polish-Americans. Pienkos, A. (1976). A Bicentennial Look at Casimir Pulaski: Polish, American and Ethnic Folk Hero. Polish American Studies, 33(1), 5–17. http://www.jstor.org/stable/20147942",
"unit_test": "Is the response helpful and aligned with the spirit of what the prompt was asking for?"
}
'
Note: Remember to:
- Replace
$API_KEY
with your API key.
- Replace the
query
, response
, and unit_test
fields with a query-response pair generated from your app and your test criteria.
- You may need to truncate lengthy responses to keep within the current character limitations for this API.
If your request is successful, you’ll receive a score that indicates how well your response meets the specified criteria.
🎉 That was a quick spin-through our tune and eval endpoints! To learn more about our APIs and their capabilities, visit docs.contextual.ai. We look forward to seeing what you build with our platform 🏗️