Code Evaluations

Code evaluations are not a good match for all tasks. They work well for deterministic tasks like classification or information extraction, but not for tasks that produce freeform outputs like chatbots or summarization. To evaluate tasks with freeform outputs, please consider criterion evaluations.

The code evaluation framework provides greater flexibility than built-in head-to-head and criterion evaluations, allowing you to grade your LLM outputs on whatever metrics you define.

Each code eval consists of a templated grader function that you can customize. Here’s the basic structure:

function grader({
  messages,
  tools,
  toolChoice,
  generatedOutput,
  datasetOutput,
}: GraderArgs): number {
  let score = 0.0;

  // begin implementation

  score = 1.0;

  // end implementation

  return score;
}

...

As you can see, the grader function takes in a number of arguments and returns a score between 0 and 1, where 1 means the generated output is perfect. The available arguments are:

messages: The messages sent to the LLM.
tools: The tools available to the LLM.
toolChoice: The tool choice specified for the LLM.
generatedOutput: The output generated by the LLM which is being evaluated.
datasetOutput: The original dataset output associated with the row being evaluated.

The grader you define can use any of the above arguments, but most often you’ll want to use generatedOutput and datasetOutput to compare the output of the LLM to the dataset output.

To get a better idea of what kinds of checks can be performed through a code evaluation, you can check out the Exact Match or Argument Comparison templates below.

Exact Match

The Exact Match template checks if the generated output matches the dataset output exactly, meaning that the content and tool calls must match exactly.

function grader({
  messages,
  tools,
  toolChoice,
  generatedOutput,
  datasetOutput,
}: GraderArgs): number {
  let score = 0.0;

  // begin implementation

  if (!exactToolCallsMatch(generatedOutput.tool_calls, datasetOutput.tool_calls)) {
    return 0.0;
  }

  if (generatedOutput.content !== datasetOutput.content) {
    return 0.0;
  }

  // generated output matches dataset output
  score = 1.0;

  // end implementation

  return score;
}

interface GraderArgs {
  messages: ChatCompletionMessageParam;
  tools: ChatCompletionTool[] | null;
  toolChoice: "none" | "auto" | ChatCompletionNamedToolChoice | null;
  generatedOutput: ChatCompletionMessage;
  datasetOutput: ChatCompletionMessage;
}

interface ChatCompletionMessageToolCallFunction {
  name: string;
  arguments: string;
}

interface ChatCompletionMessageToolCall {
  function: ChatCompletionMessageToolCallFunction;
}

interface ChatCompletionMessage {
  content: string | null;
  refusal: string | null;
  tool_calls: ChatCompletionMessageToolCall[] | null;
}

type ChatCompletionMessageParam = ChatCompletionMessage;

interface ChatCompletionTool {
  function: FunctionDefinition;
  type: "function";
}

interface FunctionDefinition {
  name: string;
  description?: string;
  parameters?: Record<string, unknown>;
}

export interface ChatCompletionNamedToolChoice {
  function: Function;
  type: "function";
}

interface Function {
  name: string;
}

function exactToolCallsMatch(
  toolCalls1: ChatCompletionMessageToolCall[] | null,
  toolCalls2: ChatCompletionMessageToolCall[] | null,
): boolean {
  // If either list is null, they can only match if both are null
  if (!toolCalls1 && !toolCalls2) {
    return true;
  }
  if (!toolCalls1 || !toolCalls2) {
    return false;
  }

  // Check if lengths match
  if (toolCalls1.length !== toolCalls2.length) {
    return false;
  }

  // Compare each tool call
  for (let i = 0; i < toolCalls1.length; i++) {
    const call1 = toolCalls1[i];
    const call2 = toolCalls2[i];

    // Compare all fields that must match exactly
    if (
      call1?.function.name !== call2?.function.name ||
      call1?.function.arguments !== call2?.function.arguments
    ) {
      return false;
    }
  }

  // If we made it through all comparisons, the calls match exactly
  return true;
}

Argument Comparison

The Argument Comparison template provides an example of how you can check whether a specific argument in the tool call generated by the LLM matches the dataset output.

function grader({
  messages,
  tools,
  toolChoice,
  generatedOutput,
  datasetOutput,
}: GraderArgs): number {
  let score = 0.0;

  // begin implementation

  const generatedToolCallArgsStr = generatedOutput.tool_calls?.[0]?.function.arguments;
  const datasetToolCallArgsStr = datasetOutput.tool_calls?.[0]?.function.arguments;

  if (!generatedToolCallArgsStr || !datasetToolCallArgsStr) {
    return 0.0;
  }

  type JudgementArgs = {
    explanation: string;
    score: number;
  };

  const generatedToolCallArgs = JSON.parse(generatedToolCallArgsStr) as JudgementArgs;
  const datasetToolCallArgs = JSON.parse(datasetToolCallArgsStr) as JudgementArgs;

  if (generatedToolCallArgs.score !== datasetToolCallArgs.score) {
    return 0.0;
  }

  score = 1.0;

  // end implementation

  return score;
}

interface GraderArgs {
  messages: ChatCompletionMessageParam;
  tools: ChatCompletionTool[] | null;
  toolChoice: "none" | "auto" | ChatCompletionNamedToolChoice | null;
  generatedOutput: ChatCompletionMessage;
  datasetOutput: ChatCompletionMessage;
}

interface ChatCompletionMessageToolCallFunction {
  name: string;
  arguments: string;
}

interface ChatCompletionMessageToolCall {
  function: ChatCompletionMessageToolCallFunction;
}

interface ChatCompletionMessage {
  content: string | null;
  refusal: string | null;
  tool_calls: ChatCompletionMessageToolCall[] | null;
}

type ChatCompletionMessageParam = ChatCompletionMessage;

interface ChatCompletionTool {
  function: FunctionDefinition;
  type: "function";
}

interface FunctionDefinition {
  name: string;
  description?: string;
  parameters?: Record<string, unknown>;
}

export interface ChatCompletionNamedToolChoice {
  function: Function;
  type: "function";
}

interface Function {
  name: string;
}

function exactToolCallsMatch(
  toolCalls1: ChatCompletionMessageToolCall[] | null,
  toolCalls2: ChatCompletionMessageToolCall[] | null,
): boolean {
  // If either list is null, they can only match if both are null
  if (!toolCalls1 && !toolCalls2) {
    return true;
  }
  if (!toolCalls1 || !toolCalls2) {
    return false;
  }

  // Check if lengths match
  if (toolCalls1.length !== toolCalls2.length) {
    return false;
  }

  // Compare each tool call
  for (let i = 0; i < toolCalls1.length; i++) {
    const call1 = toolCalls1[i];
    const call2 = toolCalls2[i];

    // Compare all fields that must match exactly
    if (
      call1?.function.name !== call2?.function.name ||
      call1?.function.arguments !== call2?.function.arguments
    ) {
      return false;
    }
  }

  // If we made it through all comparisons, the calls match exactly
  return true;
}

In most cases, you’ll want to start from one of the templates and customize the grader function to run the checks you care about. You can also use the Custom template to start from scratch.

Currently, the code evaluation framework only supports TypeScript code executed in a sandbox environment without access to the internet, external npm packages, or a file system. If you’re interested in writing evals in other languages or need more advanced features, please let us know at support@openpipe.ai.

Welcome

Getting Started

Features

API Reference

Pricing