In the Beginner’s Guide, we went through the process of creating an API Key, creating a Datastore and ingesting documents, creating an Agent, and querying the Agent. This guide covers the next steps of tuning and evaluating your Agent. Make sure you’ve gone through all the steps in the Beginner’s Guide first.

Tune and Evaluation (using curl commands)

Tune

We've created a powerful set of APIs that enable you to specialize Contextual RAG Agents to your data. Tuning often leads to significant improvements in performance for your specific use cases.

  1. Create a tune job

To create a tune job, you need a training file and can optionally provide a test file. If no test file is provided, the API will automatically perform a train-test split on the training file.

The API expects the data to be in JSON format with four required fields: guideline,prompt,reference,knowledge. See the API docs for an explanation of each of these fields. Here is a dummy example of what a tune set should look like.

[
  {
    "guideline": "The answer should be accurate.",
    "prompt": "What was last quarter's revenue?",
    "reference": "According to recent reports, the Q3 revenue was $1.2 million, a 0.1 million increase from Q2.",
    "knowledge": [
        "Quarterly report: Q3 revenue was $1.2 million.",
        "Quarterly report: Q2 revenue was $1.1 million.",
        ...
    ],
  },
  ...
]

Use the following command to create a tune job. You will need to pass in the agent_id and file_path for your training file. If you do not provide a model_id, we will automatically use the Agent’s default model.

curl --request 'POST' \
  --url 'https://api.contextual.ai/v1/agents/{agent_id}/tune' \
  --header 'accept: application/json' \
  --header 'Content-Type: multipart/form-data' \
  --header 'Authorization: Bearer $API_KEY' \
  --form 'training_file=@{$file_path}' 

Note: Remember to:

  • Replace {agent_id} with the ID of your Agent.
  • Replace $API_KEY with your own API key.
  • Replace {$file_path} with the path to your JSON training file.

If the command is successful, you’ll be returned a job_id for the tune job. Keep in mind that tuning will take some time to complete.

  1. Check the status of the tune job

You can check the status of the tune job by passing in the agent_id and job_id. When the job is complete, the status will change from processing to completed. The response payload will also contain the tuned model_id and the evaluation_results of the tuned model.

curl --request 'GET' \
  --url 'https://api.contextual.ai/v1/agents/{agent_id}/tune/jobs/{job_id}/metadata' \
  --header 'accept: application/json' \
  --header 'Authorization: Bearer $API_KEY'

Note: Remember to:

  • Replace {agent_id} with the ID of your Agent.
  • Replace {job_id} with the ID of your tuning job.
  • Replace $API_KEY with your own API key.
  1. Deploy the tuned model

Before you can use the tuned model, you need to deploy it to your Agent. You can do so by editing the configuration of your Agent and passing in the tuned model_id. Currently, we only allow a single fine-tuned model to be deployed per tenant. Please see the API docs for more information.

curl --request 'PUT' \
  --url 'https://api.contextual.ai/v1/agents/{agent_id}' \
  --header 'accept: application/json' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer $API_KEY' \
  --data '{
  "llm_model_id": "$model_id"
}'

Note: Remember to:

  • Replace {agent_id} with the ID of your Agent.
  • Replace $API_KEY with your own API key.
  • Replace $model_id with the ID of your tuned model.

The deployment might take a moment to complete.

  1. Query your tuned model!

After you have deployed the tuned model, you can now query it with the usual command. Make sure you pass your new tuned model_id in.

curl --request POST \
     --url https://api.contextual.ai/v1/agents/{agent_id}/query \
     --header 'accept: application/json' \
     --header 'authorization: Bearer $API_KEY' \
     --header 'content-type: application/json' \
     --data '
{
  "stream": false,
  "llm_model_id": $model_id,
  "messages": [
    {
      "role": "user",
      "content": "What is the revenue of Apple?"
    }
  ]
}
'

Note: Remember to:

  • Replace {agent_id} with the ID of your Agent.
  • Replace $API_KEY with your own API key.
  • Replace $model_id with the ID of your tuned model.

Evaluation

Evaluation endpoints allow you to evaluate your Agent using a set of prompts (questions) and reference (gold) answers. We support two metrics: equivalence and groundedness.

  • The first metric (”equivalence”) evaluates if the Agent response is equivalent to the ground truth (model-driven binary classification).
  • The second metric (”groundedness”) decomposes the Agent response into claims and then evaluates if the claims are grounded by the retrieved documents.
  1. Create an evaluation job.

You will need to provide the evaluation data. You can provide the evaluation data in two ways: (i) by uploading an evalset_file as a CSV or (ii) creating an eval Dataset through the Dataset API. We will be focusing on (i), but you can read about (ii) in our API Docs.

The API expects the data to be in CSV format with two required columns: prompt,reference. prompt is the question, while reference is the correct ground truth answer. See the API docs for an explanation of each of these fields. Here is a dummy example of what an eval set should look like.

{
 "prompt": "What was the sales of Apple at the end of Q3 2022?",
 "reference": "Apple's sales was 100 million in the quarter ending Aug 31, 2022."
}

Use the following command to create your evaluation job. You will need to pass in your agent_id and file_path to your evaluation set. In the example below, we are evaluating on both equivalence and groundedness, but you can choose to evaluate only one of them.

curl --request 'POST' \
  --url 'https://api.contextual.ai/v1/agents/{agent_id}/evaluate' \
  --header 'accept: application/json' \
  --header 'Content-Type: multipart/form-data' \
  --header 'Authorization: Bearer $API_KEY' \
  --form 'metrics[]=equivalence' \
  --form 'metrics[]=groundedness' \
  --form 'evalset_file=@{$file_path};type=text/csv'

Note: Remember to:

  • Replace {agent_id} with the ID of your Agent.
  • Replace $API_KEY with your own API key.
  • Choose which metrics you want (equivalence, groundedness).
  • Replace {$file_path} with the path to your evaluation CSV.
  1. Check the status of your evaluation job.

You can use the following command to check the status of your evaluation job, where you’ll need to pass in your agent_id and evaluation job_id. If the evaluation job has completed, you will see your evaluation metrics , job_metadata, and the dataset_name where your eval metrics and row-by-row results are stored (you will need to use the /datasets API to view this dataset).

curl --request 'GET' \
  --url 'https://api.contextual.ai/v1/agents/{agent_id}/evaluate/jobs/{job_id}/metadata' \
  --header 'accept: application/json' \
  --header 'Authorization: Bearer $API_KEY'

Note: Remember to:

  • Replace {agent_id} with the ID of your Agent.
  • Replace {job_id} with the ID of your evaluation job.
  • Replace $API_KEY with your own API key.
  1. View your evaluation results.

In Step 2, you should be able to get a dataset_name when your evaluation job has completed. You can then view your raw evaluation results (equivalence and/or groundedness scores for each question-response pair) with the /datasets endpoint. You can use the following command:

curl --request GET \
     --url https://api.contextual.ai/v1/agents/{agent_id}/datasets/evaluate/{dataset_name} \
     --header 'accept: application/octet-stream' \
     --header 'authorization: Bearer $API_KEY'

Note: Remember to:

  • Replace {agent_id} with the ID of your Agent.
  • Replace {dataset_name} with the name of your evaluation Dataset (from Step 2).
  • Replace $API_KEY with your own API key.

LMUnit

In the previous section, we talked about how you can evaluate your Agent with a curated evalset. The Contextual Platform also provides another method of evaluation via natural language unit tests using the /lmunit endpoint. To understand the use cases of /lmunit, please read our blogpost.

Follow these steps to use the /lmunit endpoint.

  1. Come up with criteria for what constitutes a good response in the context of your agent. The criteria can be about the content, form, or style of a response. Is there a style or tone you want the agent to maintain? Do good answers exhibit a specific reasoning pattern or cover specific content?
  2. Translate one of these criteria into a specific, clear, and testable statement or question. For example:
    1. Does the response maintain a professional style?
    2. Does the response impartially cover different opinions or perspectives that exist on a question?
    3. Does the response mention US Federal Reserve policy if the question is about the rate of inflation?
  3. Use the /lmunit endpoint to evaluate a given query-response pair. Scores are reported on a 1-5 scale, with higher scores indicating better satisfaction of the test criteria.
curl --request POST \
     --url https://api.contextual.ai/v1/lmunit \
     --header 'accept: application/json' \
     --header 'authorization: Bearer $API_KEY ' \
     --header 'content-type: application/json' \
     --data '
{
  "query": "I remember learning about an ex prisoner who was brought to America to help
  train the soldiers. But the details escape me. Can anyone provide details to who he was?",
  
  "response": "Those clues are kind of vague, but one possible candidate might be
  Casimir Pulaski. He was an effective cavalry officer who was embroiled in the chaos of
  Poland in the later 18th c. and fought on a losing side, but while he was tried and condemned
  and his possessions confiscated, he’d fled to France by then. So, “ex prisoner” is not quite
  correct. But he did indeed help train American cavalry—and irritated quite a few who served with
  him with his imperious manner. If you heard about him in the US, it might be because there are a
  lot of towns named after him, and he became quite a popular hero to later Polish-Americans.
  Pienkos, A. (1976). A Bicentennial Look at Casimir Pulaski: Polish, American and Ethnic Folk Hero.
  Polish American Studies, 33(1), 5–17. http://www.jstor.org/stable/20147942",
  
  "unit_test": "Is the response helpful and aligned with the spirit of what the prompt
  was asking for?"
}
'

Note: Remember to:

  • Replace $API_KEY with your API key.
  • Replace the query, response, and unit_test fields with a query-response pair generated from your app and your test criteria.
    • You may need to truncate lengthy responses to keep within the current character limitations for this API.

If your request is successful, you'll receive a score. LMUnit enables targeted evaluation of criteria that you care about, enabling you to evolve and optimize your Agents.

🎉 That was a quick spin-through our tune and eval endpoints! To learn more about our APIs and their capabilities, visit docs.contextual.ai. We look forward to seeing what you build with our platform 🏗️