Criteria are currently in beta. Talk to the OpenPipe team (hello@openpipe.ai) to get access.

Criteria are a reliable way to detect and correct mistakes in LLM output. Criteria are currently used when defining LLM evaluations and will soon be integrated into the data relabeling flow to improve dataset quality.

Before you begin

Before creating your first criterion, you should identify an issue with your model’s output that you want to detect and correct. You should also have either an OpenPipe dataset or a JSONL file containing several rows of data that exhibit the issue, and several that don’t.

1

Open the creation modal

Navigate to the Criteria tab and click New Criterion. The criterion creation modal will open with a default prompt and judge model.

2

Draft an initial prompt

Write an initial LLM prompt with basic instructions for identifying rows containing the issue you want to detect and correct. Don’t worry about engineering a perfect prompt, you’ll have a chance to improve it during the alignment process.

As an example, if you want to detect rows in which the model’s output is in a different language than the input, you might write a prompt like this:

Mark the criteria as passed if the input and output are the same language.
Mark it as failed if they are in different languages.

Make sure to use the terms input, output, passed, and failed in your prompt to match our internal templating.

Finally, import a few rows (we recommend at least 30) into an alignment set for the criterion.

3

Confirm creation

Click Create to create the criterion and run the initial prompt against the imported alignment set. You’ll be redirected to the criterion’s alignment page.

4

Align Criterion

Aligning a criterion involves two simple processes:

  • Manually labeling outputs
  • Refining the criterion
1

Manually labeling outputs

In order to know whether you agree with your criterion’s judgements, you’ll need to label some data yourself. Use the Alignment UI to manually label each output with PASS or FAIL based on the criterion. Feel free to SKIP outputs you aren’t sure about and come back to them later.

Try to label at least 30 rows to provide a reliable estimate of the LLM’s precision and recall.

2

Refining the criterion

As you record your own judgements, alter the criterion’s prompt and judge model to align its judgements with your own.

Investing time in a good prompt and selecting the best judge model pays dividends even before you begin using the criterion to improve your dataset. High-quality LLM judgements speed up the process of manually labeling rows yourself by helping you quickly identify rows that fail the criterion.

As you improve your criterion prompt, you’ll notice your alignment stats improve. Once you’ve labeled enough rows and are satisfied with the precision and recall of your LLM judge, the criterion is ready to be used!

5

Using your Criterion

Once your criterion has been aligned, you can use it to create criterion evals. Unlike head to head evals, criterion evals are not pairwise comparisons. Instead, they evaluate the quality of a model’s output across a dataset according to a specific criterion.

To create a criterion eval, navigate to the Evals tab and click New Evaluation. Select Add criterion eval from the list.

Choose the dataset you’d like to evaluate. Just like when creating head to head evals, you can choose to evaluate any model’s output or the dataset output itself. Next, choose, the criterion you would like to test your data against. The same judge model and prompt you defined when creating the criterion will be used to run the evaluation.

Finally, click Create to run the evaluation. As soon as the evaluation completes, you’ll be able to view evaluation results based on aligned LLM judgements!