Launch an Evaluation
job which evaluates an Agent
on a set of test questions and reference answers.
An Evaluation
is an asynchronous operation. Users can select one or more metrics to assess the quality of generated answers. These metrics include equivalence
and groundedness
. equivalence
evaluates if the Agent response is equivalent to the ground truth (model-driven binary classification). groundedness
decomposes the Agent response into claims and then evaluates if the claims are grounded by the retrieved documents.
Evaluation
data can be provided in one of two forms:
-
A CSV
evalset_file
containing the columnsprompt
(i.e. questions) andreference
(i.e. gold-answers). -
An
evalset_name
which refers to aDataset
created through the/datasets/evaluate
API.