Use the Criteria API for runtime evaluation and offline testing.
/chat/completions
endpoint, you can specify a list of criteria to run immediately after a completion is generated. We recommend generating multiple responses from the same prompt, each of which will be scored by the specified criteria. The responses will be sorted by their combined score across all criteria, from highest to lowest. This technique is known as Best of N sampling.
To invoke criteria, add an op-criteria
header to your request with a list of criterion IDs, like so:
criterion-1@v1
, or default to the latest criterion version, like criterion-2
.
In addition to the usual fields, each chat completion choice will now include a criteria_results
object, which contains the judgements of the specified criteria. The array of completion choices will take the following form:
/criteria/judge
endpoint. You can request judgements using either the TypeScript or Python SDKs, or through a cURL request.