Code evaluations are not a good match for all tasks. They work well for deterministic tasks like
classification or information extraction, but not for tasks that produce freeform outputs like
chatbots or summarization. To evaluate tasks with freeform outputs, please consider criterion
evaluations.
The code evaluation framework provides greater flexibility than built-in head-to-head and criterion evaluations, allowing you to grade your LLM outputs on whatever metrics you define.
Each code eval consists of a templated grader function that you can customize. Here’s the basic structure:
Copy
function grader({ messages, tools, toolChoice, generatedOutput, datasetOutput,}: GraderArgs): number { let score = 0.0; // begin implementation score = 1.0; // end implementation return score;}...
As you can see, the grader function takes in a number of arguments and returns a score between 0 and 1, where 1 means the generated output is perfect. The available arguments are:
messages: The messages sent to the LLM.
tools: The tools available to the LLM.
toolChoice: The tool choice specified for the LLM.
generatedOutput: The output generated by the LLM which is being evaluated.
datasetOutput: The original dataset output associated with the row being evaluated.
The grader you define can use any of the above arguments, but most often you’ll want to use generatedOutput and datasetOutput to compare the output of the LLM to the dataset output.
To get a better idea of what kinds of checks can be performed through a code evaluation, you can check out the Exact Match or Argument Comparison templates below.
The Exact Match template checks if the generated output matches the dataset output exactly, meaning that the content and tool calls must match exactly.
Copy
function grader({ messages, tools, toolChoice, generatedOutput, datasetOutput,}: GraderArgs): number { let score = 0.0; // begin implementation if (!exactToolCallsMatch(generatedOutput.tool_calls, datasetOutput.tool_calls)) { return 0.0; } if (generatedOutput.content !== datasetOutput.content) { return 0.0; } // generated output matches dataset output score = 1.0; // end implementation return score;}interface GraderArgs { messages: ChatCompletionMessageParam; tools: ChatCompletionTool[] | null; toolChoice: "none" | "auto" | ChatCompletionNamedToolChoice | null; generatedOutput: ChatCompletionMessage; datasetOutput: ChatCompletionMessage;}interface ChatCompletionMessageToolCallFunction { name: string; arguments: string;}interface ChatCompletionMessageToolCall { function: ChatCompletionMessageToolCallFunction;}interface ChatCompletionMessage { content: string | null; refusal: string | null; tool_calls: ChatCompletionMessageToolCall[] | null;}type ChatCompletionMessageParam = ChatCompletionMessage;interface ChatCompletionTool { function: FunctionDefinition; type: "function";}interface FunctionDefinition { name: string; description?: string; parameters?: Record<string, unknown>;}export interface ChatCompletionNamedToolChoice { function: Function; type: "function";}interface Function { name: string;}function exactToolCallsMatch( toolCalls1: ChatCompletionMessageToolCall[] | null, toolCalls2: ChatCompletionMessageToolCall[] | null,): boolean { // If either list is null, they can only match if both are null if (!toolCalls1 && !toolCalls2) { return true; } if (!toolCalls1 || !toolCalls2) { return false; } // Check if lengths match if (toolCalls1.length !== toolCalls2.length) { return false; } // Compare each tool call for (let i = 0; i < toolCalls1.length; i++) { const call1 = toolCalls1[i]; const call2 = toolCalls2[i]; // Compare all fields that must match exactly if ( call1?.function.name !== call2?.function.name || call1?.function.arguments !== call2?.function.arguments ) { return false; } } // If we made it through all comparisons, the calls match exactly return true;}
The Argument Comparison template provides an example of how you can check whether a specific argument in the tool call generated by the LLM matches the dataset output.
Copy
function grader({ messages, tools, toolChoice, generatedOutput, datasetOutput,}: GraderArgs): number { let score = 0.0; // begin implementation const generatedToolCallArgsStr = generatedOutput.tool_calls?.[0]?.function.arguments; const datasetToolCallArgsStr = datasetOutput.tool_calls?.[0]?.function.arguments; if (!generatedToolCallArgsStr || !datasetToolCallArgsStr) { return 0.0; } type JudgementArgs = { explanation: string; score: number; }; const generatedToolCallArgs = JSON.parse(generatedToolCallArgsStr) as JudgementArgs; const datasetToolCallArgs = JSON.parse(datasetToolCallArgsStr) as JudgementArgs; if (generatedToolCallArgs.score !== datasetToolCallArgs.score) { return 0.0; } score = 1.0; // end implementation return score;}interface GraderArgs { messages: ChatCompletionMessageParam; tools: ChatCompletionTool[] | null; toolChoice: "none" | "auto" | ChatCompletionNamedToolChoice | null; generatedOutput: ChatCompletionMessage; datasetOutput: ChatCompletionMessage;}interface ChatCompletionMessageToolCallFunction { name: string; arguments: string;}interface ChatCompletionMessageToolCall { function: ChatCompletionMessageToolCallFunction;}interface ChatCompletionMessage { content: string | null; refusal: string | null; tool_calls: ChatCompletionMessageToolCall[] | null;}type ChatCompletionMessageParam = ChatCompletionMessage;interface ChatCompletionTool { function: FunctionDefinition; type: "function";}interface FunctionDefinition { name: string; description?: string; parameters?: Record<string, unknown>;}export interface ChatCompletionNamedToolChoice { function: Function; type: "function";}interface Function { name: string;}function exactToolCallsMatch( toolCalls1: ChatCompletionMessageToolCall[] | null, toolCalls2: ChatCompletionMessageToolCall[] | null,): boolean { // If either list is null, they can only match if both are null if (!toolCalls1 && !toolCalls2) { return true; } if (!toolCalls1 || !toolCalls2) { return false; } // Check if lengths match if (toolCalls1.length !== toolCalls2.length) { return false; } // Compare each tool call for (let i = 0; i < toolCalls1.length; i++) { const call1 = toolCalls1[i]; const call2 = toolCalls2[i]; // Compare all fields that must match exactly if ( call1?.function.name !== call2?.function.name || call1?.function.arguments !== call2?.function.arguments ) { return false; } } // If we made it through all comparisons, the calls match exactly return true;}
In most cases, you’ll want to start from one of the templates and customize the grader function to run the checks you care about. You can also use the Custom template to start from scratch.
Currently, the code evaluation framework only supports TypeScript code executed in a sandbox
environment without access to the internet, external npm packages, or a file system. If you’re
interested in writing evals in other languages or need more advanced features, please let us know
at support@openpipe.ai.