Tuning and Evaluation
Learn how to tune and evaluate your agent
This guide covers the next steps of tuning and evaluating your Agent. Make sure you’ve gone through the Beginner’s Guide first.
For a hands on experience, go through our Colab Notebook:
Open Notebook in Colab
Tune
Fine-tuning an agent on domain-specific data helps the agent learn where to focus its attention and how to better interpret domain-specific information. When combined, the approaches are mutually reinforcing: RAG provides the up-to-date knowledge, while fine-tuning optimizes how that knowledge is processed and applied.
Fine-tuning of RAG Agents has been proven very effective across our customer deployments and research benchmarks. We’ve provided a powerful set of APIs that enable you to fine-tune Contextual RAG Agents to your data.
Step 1: Create a tune job
To create a tune job, you need to provide a training file and an (optional) test file. The file(s) must be in JSONL
or CSV
format with four required fields (see below). If no test file is provided, the API will automatically perform a train-test split on the training file. Your data files must have the following fields:
The question you are asking the Agent.
The gold-standard response.
The list of retrieved chunks used to generate the gold-standard response.
Guidelines for the Agent’s output. You can write a few lines on what you expect the Agent’s response to look like, which will help guide its learning process. If you do not have any special guidelines, you can just use the System Prompt in your agent configuration.
You can then run the following command:
If the command is successful, you’ll be returned a job_id
for the tune job. Keep in mind that tuning will take some time to complete.
After creating a tune job with locally uploaded files, the train and test
files will automatically be saved as Datasets
for future use. When creating
a future tune job, you can simply use the Dataset
names as your input
parameter. You can manage these Datasets
with the /datasets/tune
and
/datasets/evaluation
endpoint.
Step 2: Check the status of the tune job
You can check the status of the tune job by passing in the agent_id
and job_id
.
When the job is completed, the status will change from processing
to completed
. The response payload will contain the tuned model_id
which you will use to activate your tuned model, as well as evaluation data from the test split of your tuning dataset. Here is what the response payload looks like:
We store your row-by-row evaluation results in a Dataset
that you can find in the response payload (in this example, it is eval-results-101
). You can use the /datasets/evaluate
API to inspect your row-by-row results.
Step 3: Activate the tuned model
Before you can use the tuned model, you need to activate it for your Agent. You can do so by editing the configuration of your Agent and passing in the tuned model_id
. Currently, we allow a maximum of 3 tuned models to be activated per tenant.
The activation might take a moment to complete.
Step 4: Query your tuned model!
After you have activated the tuned model, you can now query it with the usual command. The tuned model is automatically set as the default for your Agent, so you don’t have to fill the llm_model_id
field.
Step 5: Deactivate your tuned model (if needed)
We support a maximum of 3 activated models per tenant. As such, you might want to deactivate an existing tuned model to make way for another. To do so, you will have to edit the config of the Agent where that existing model is activated, and change the llm_model_id
to "default"
.
Evaluation
Evaluation assesses your model’s performance and identifies areas of weaknesses. It can create a positive feedback loop with Tuning for continuous improvements.
Our Evaluation endpoints allow you to evaluate your Agent using a set of prompts
(questions) and reference
(gold) answers. We currently support two metrics: equivalence
and groundedness
.
Equivalence
: Uses a Language Model as a judge to evaluate if the Agent’s response is equivalent to the Reference (or gold-standard answer).Groundedness
: Decomposes the Agent’s response into claims, and then uses a Language Model to evaluate if the claims are grounded in the retrieved documents.
Step 1: Create an evaluation job
To create an evaluation job, you need to provide an evaluation set. The file(s) must be in CSV
format with two required fields:
The question you are asking the Agent.
The gold-standard response.
Use the following command to create your evaluation job. You will need to pass in your agent_id
and file_path
to your evaluation set.
If the command is successful, you’ll be returned a job_id
for the evaluation job. Keep in mind that evaluation will take some time to complete.
Step 2: Check the status of your evaluation job
You can check the status of your evaluation job by passing in the agent_id
and job_id
.
When the job is completed, the status will change from processing
to completed
. The response payload will contain evaluation metrics and metadata. Here is what the response payload looks like:
The metrics
field contains the evaluation scores obtained by the model. job_metadata
contains information on the evaluation performed, including the number of successful/failed predictions and the Dataset
where your row-by-row evaluation results are stored.
Step 3: View your evaluation results
In Step 2, you will get a dataset_name
when your evaluation job has completed. In the example above, this is evalresults-101
. You can then view your raw evaluation results (equivalence and/or groundedness scores for each question-response pair) with the /datasets/evaluate
endpoint.
For a guided walkthrough of the evaluation process, see the Colab Notebook linked at the top of the page.
LMUnit
In addition to the standard evaluation methods described above, the Contextual Platform also provides another method of evaluation via natural language unit tests using the /lmunit
endpoint. To understand the use cases of /lmunit
, please read our blogpost.
LMUnit enables targeted evaluation of specific criteria that you care about, allowing you to evolve and optimize your Agents.
Step 1: Define your evaluation criteria
Come up with criteria for what constitutes a good response in the context of your agent. The criteria can be about the content, form, or style of a response:
- Is there a style or tone you want the agent to maintain?
- Do good answers exhibit a specific reasoning pattern?
- Should responses cover specific content areas?
Step 2: Create testable statements
Translate your criteria into specific, clear, and testable statements or questions. For example:
- Does the response maintain a professional style?
- Does the response impartially cover different opinions or perspectives that exist on a question?
- Does the response mention US Federal Reserve policy if the question is about the rate of inflation?
Step 3: Use the LMUnit endpoint
Use the /lmunit
endpoint to evaluate a given query-response pair. Scores are reported on a 1-5 scale, with higher scores indicating better satisfaction of the test criteria.
Note: Remember to:
- Replace
$API_KEY
with your API key. - Replace the
query
,response
, andunit_test
fields with a query-response pair generated from your app and your test criteria. - You may need to truncate lengthy responses to keep within the current character limitations for this API.
If your request is successful, you’ll receive a score that indicates how well your response meets the specified criteria.
🎉 That was a quick spin-through our tune and eval endpoints! To learn more about our APIs and their capabilities, visit docs.contextual.ai. We look forward to seeing what you build with our platform 🏗️
Was this page helpful?