Train reward models to judge the quality of LLM responses based on preference data.
rejected_message
field should contain the rejected response, and the final assistant
message should be the preferred response. Conveniently, this is the same data format used for DPO, which allows you to train both reward and completion models from the same dataset.