Once your model is trained, the next thing you want to know is how well it performs. OpenPipe’s built-in evaluation framework makes it easy to compare new models you train against previous models and generic OpenAI models as well.

When you train a model 10% of the dataset entries you provide will be withheld from training. These entries form your test set. For each entry in the test set, your new model will produce an output that will be shown in the evaluation table.



While this table makes it really easy to compare model output for a given input side by side, it doesn’t actually let you know which model is doing better in general. For that, we need custom evaluations. Evaluations allow you to compare model outputs across a variety of inputs to determine which model is doing a better job. On the backend, we use GPT-4 as a judge to determine which output is a better fit for the test dataset entry. You can configure the exact judgement criteria, which models will be judged, and how many dataset entries will be included in the evaluation from the evaluation’s Settings page.



Results are shown in both a table and a head-to-head comparison view on the Results page.



To see the whole thing in action, check out the Evaluate tab in our public Bullet Point Generator dataset. Feel free to play around with the display settings to get a feel for how individual models compare against one another!

Evaluation models

We provide OpenAI LLMs like GPT-4 and GPT-4-Turbo for evaluations by default. These models serve as a solid benchmark for comparing the performance of your fine-tuned models.

In addition to the OpenAI models, you can add any hosted model with an OpenAI-compatible API to compare outputs with your fine-tuned models.

To add an external model for evaluation, navigate to the Project settings page, where you’ll find the option to include additional models in your evaluations.