Alignment sets are a collection of LLM input/output pairs that are judged by both the criterion LLM judge and a human. The performance of the criterion LLM judge is then measured by how well it matches the judgements of the human judge. We recommend importing and judging at least 30 rows to ensure the alignment stats are meaningful.

Importing an Alignment Set

You can import an alignment set from either an OpenPipe dataset or a JSONL file. Alignment sets can be added to an existing criterion or imported when a new criterion is created.

Importing from a Dataset

When importing from a dataset, you select a number of rows to be randomly sampled from the dataset of your choice to imported into the criterion alignment set. The inputs of each of these rows will be copied directly from the rows in the dataset without any changes. By default, the outputs will also be copied from the original dataset. However, if you set Output Source to be an LLM model, the outputs will be generated by the LLM model based on the dataset inputs.

Importing from a JSONL File

You can also import an alignment set from a JSONL file. Uploads are limited to 10MB in size, which should be plenty for an alignment set.

The schema of the JSONL file is exactly the same as an OpenAI-compatible JSONL fine-tuning file, but also supports an optional judgement field for each row. judgement can be either PASS or FAIL, depending on whether the row should pass or fail the criterion.

Example

...
{"judgement": "PASS", "messages":[{"role":"system","content":"You are a helpful assistant"},{"role":"user","content":"What is the capital of Tasmania?"},{"role":"assistant","content":null,"tool_calls":[{"id":"","type":"function","function":{"name":"identify_capital","arguments":"{\"capital\":\"Hobart\"}"}}]}],"tools":[{"type":"function","function":{"name":"identify_capital","parameters":{"type":"object","properties":{"capital":{"type":"string"}}}}}]}
{"judgement": "FAIL", "messages":[{"role":"system","content":"You are a helpful assistant"},{"role":"user","content":"What is the capital of Sweden?"},{"role":"assistant","content":null,"tool_calls":[{"id":"","type":"function","function":{"name":"identify_capital","arguments":"{\"capital\":\"Beijing\"}"}}]}],"tools":[{"type":"function","function":{"name":"identify_capital","parameters":{"type":"object","properties":{"capital":{"type":"string"}}}}}]}
{"messages":[{"role":"system","content":"You are a helpful assistant"},{"role":"user","content":"What is the capital of Sweden?"},{"role":"assistant","content":null,"tool_calls":[{"id":"","type":"function","function":{"name":"identify_capital","arguments":"{\"capital\":\"Stockholm\"}"}}]}],"tools":[{"type":"function","function":{"name":"identify_capital","parameters":{"type":"object","properties":{"capital":{"type":"string"}}}}}]}
...

Alignment Stats

Alignment stats are a simple way to understand how well your criterion is performing. As you refine your criterion prompt, you’re alignment stats will improve as well.

  • Precision indicates the fraction of rows that the LLM judge labeled as failing that a human judge also labeled as failing. It’s an indicator of how reliable the LLM judge’s FAIL label is.
  • Recall indicates the fraction of rows that a human judge labeled as failing that the LLM judge also labeled as failing. It’s an indicator of how reliable the LLM judge’s PASS label is.
  • F1 Score is the harmonic mean of precision and recall. As either score improves, the F1 score will also improve.

To ensure your alignment stats are meaningful, we recommend labeling at least 30 rows, but in some cases you may need to label more in order to get a reliable statistic.