Learn how to tune and evaluate your agent
JSONL
or CSV
format with four required fields (see below). If no test file is provided, the API will automatically perform a train-test split on the training file. Your data files must have the following fields:
job_id
for the tune job. Keep in mind that tuning will take some time to complete.
Datasets
for future use. When creating
a future tune job, you can simply use the Dataset
names as your input
parameter. You can manage these Datasets
with the /datasets/tune
and
/datasets/evaluation
endpoint.agent_id
and job_id
.
processing
to completed
. The response payload will contain the tuned model_id
which you will use to activate your tuned model, as well as evaluation data from the test split of your tuning dataset. Here is what the response payload looks like:
Dataset
that you can find in the response payload (in this example, it is eval-results-101
). You can use the /datasets/evaluate
API to inspect your row-by-row results.
model_id
. Currently, we allow a maximum of 3 tuned models to be activated per tenant.
llm_model_id
field.
llm_model_id
to "default"
.
prompts
(questions) and reference
(gold) answers. We currently support two metrics: equivalence
and groundedness
.
Equivalence
: Uses a Language Model as a judge to evaluate if the Agent’s response is equivalent to the Reference (or gold-standard answer).Groundedness
: Decomposes the Agent’s response into claims, and then uses a Language Model to evaluate if the claims are grounded in the retrieved documents.CSV
format with two required fields:
agent_id
and file_path
to your evaluation set.
job_id
for the evaluation job. Keep in mind that evaluation will take some time to complete.
agent_id
and job_id
.
processing
to completed
. The response payload will contain evaluation metrics and metadata. Here is what the response payload looks like:
metrics
field contains the evaluation scores obtained by the model. job_metadata
contains information on the evaluation performed, including the number of successful/failed predictions and the Dataset
where your row-by-row evaluation results are stored.
dataset_name
when your evaluation job has completed. In the example above, this is evalresults-101
. You can then view your raw evaluation results (equivalence and/or groundedness scores for each question-response pair) with the /datasets/evaluate
endpoint.
/lmunit
endpoint. To understand the use cases of /lmunit
, please read our blogpost.
LMUnit enables targeted evaluation of specific criteria that you care about, allowing you to evolve and optimize your Agents.
/lmunit
endpoint to evaluate a given query-response pair. Scores are reported on a 1-5 scale, with higher scores indicating better satisfaction of the test criteria.
$API_KEY
with your API key.query
, response
, and unit_test
fields with a query-response pair generated from your app and your test criteria.