> ## Documentation Index
> Fetch the complete documentation index at: https://docs.openpipe.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Code Evaluations

>  Write custom code to evaluate your LLM outputs. 

<Note>
  Code evaluations are not a good match for all tasks. They work well for deterministic tasks like
  classification or information extraction, but not for tasks that produce freeform outputs like
  chatbots or summarization. To evaluate tasks with freeform outputs, please consider [criterion
  evaluations](/features/evaluations/criterion).
</Note>

The code evaluation framework provides greater flexibility than built-in head-to-head and criterion evaluations, allowing you to grade your LLM outputs on whatever metrics you define.

<Frame>
  <img src="https://mintcdn.com/openpipe/yLyh_RHELnvU-7tP/images/features/evaluations/new-code-eval-modal.png?fit=max&auto=format&n=yLyh_RHELnvU-7tP&q=85&s=028b9647ebf79d8f0a20b2603f9adb1b" alt="" width="2992" height="1714" data-path="images/features/evaluations/new-code-eval-modal.png" />
</Frame>

<br />

Each code eval consists of a templated `grader` function that you can customize. Here's the basic structure:

```typescript theme={null}
function grader({
  messages,
  tools,
  toolChoice,
  generatedOutput,
  datasetOutput,
}: GraderArgs): number {
  let score = 0.0;

  // begin implementation

  score = 1.0;

  // end implementation

  return score;
}

...
```

As you can see, the `grader` function takes in a number of arguments and returns a score between 0 and 1, where 1 means the generated output is perfect. The available arguments are:

* `messages`: The messages sent to the LLM.
* `tools`: The tools available to the LLM.
* `toolChoice`: The tool choice specified for the LLM.
* `generatedOutput`: The output generated by the LLM which is being evaluated.
* `datasetOutput`: The original dataset output associated with the row being evaluated.

The grader you define can use any of the above arguments, but most often you'll want to use `generatedOutput` and `datasetOutput` to compare the output of the LLM to the dataset output.

<Frame>
  <img src="https://mintcdn.com/openpipe/yLyh_RHELnvU-7tP/images/features/evaluations/editable-lines.png?fit=max&auto=format&n=yLyh_RHELnvU-7tP&q=85&s=92a5298f7eea9b3ff368ffd0aa2858fd" alt="" width="1758" height="862" data-path="images/features/evaluations/editable-lines.png" />
</Frame>

<br />

To get a better idea of what kinds of checks can be performed through a code evaluation, you can check out the **Exact Match** or **Argument Comparison** templates below.

<Accordion title="Exact Match">
  The **Exact Match** template checks if the generated output matches the dataset output exactly, meaning that the content and tool calls must match exactly.

  ```typescript theme={null}
  function grader({
    messages,
    tools,
    toolChoice,
    generatedOutput,
    datasetOutput,
  }: GraderArgs): number {
    let score = 0.0;

    // begin implementation

    if (!exactToolCallsMatch(generatedOutput.tool_calls, datasetOutput.tool_calls)) {
      return 0.0;
    }

    if (generatedOutput.content !== datasetOutput.content) {
      return 0.0;
    }

    // generated output matches dataset output
    score = 1.0;

    // end implementation

    return score;
  }

  interface GraderArgs {
    messages: ChatCompletionMessageParam;
    tools: ChatCompletionTool[] | null;
    toolChoice: "none" | "auto" | ChatCompletionNamedToolChoice | null;
    generatedOutput: ChatCompletionMessage;
    datasetOutput: ChatCompletionMessage;
  }

  interface ChatCompletionMessageToolCallFunction {
    name: string;
    arguments: string;
  }

  interface ChatCompletionMessageToolCall {
    function: ChatCompletionMessageToolCallFunction;
  }

  interface ChatCompletionMessage {
    content: string | null;
    refusal: string | null;
    tool_calls: ChatCompletionMessageToolCall[] | null;
  }

  type ChatCompletionMessageParam = ChatCompletionMessage;

  interface ChatCompletionTool {
    function: FunctionDefinition;
    type: "function";
  }

  interface FunctionDefinition {
    name: string;
    description?: string;
    parameters?: Record<string, unknown>;
  }

  export interface ChatCompletionNamedToolChoice {
    function: Function;
    type: "function";
  }

  interface Function {
    name: string;
  }

  function exactToolCallsMatch(
    toolCalls1: ChatCompletionMessageToolCall[] | null,
    toolCalls2: ChatCompletionMessageToolCall[] | null,
  ): boolean {
    // If either list is null, they can only match if both are null
    if (!toolCalls1 && !toolCalls2) {
      return true;
    }
    if (!toolCalls1 || !toolCalls2) {
      return false;
    }

    // Check if lengths match
    if (toolCalls1.length !== toolCalls2.length) {
      return false;
    }

    // Compare each tool call
    for (let i = 0; i < toolCalls1.length; i++) {
      const call1 = toolCalls1[i];
      const call2 = toolCalls2[i];

      // Compare all fields that must match exactly
      if (
        call1?.function.name !== call2?.function.name ||
        call1?.function.arguments !== call2?.function.arguments
      ) {
        return false;
      }
    }

    // If we made it through all comparisons, the calls match exactly
    return true;
  }
  ```
</Accordion>

<Accordion title="Argument Comparison">
  The **Argument Comparison** template provides an example of how you can check whether a specific argument in the tool call generated by the LLM matches the dataset output.

  ```typescript theme={null}
  function grader({
    messages,
    tools,
    toolChoice,
    generatedOutput,
    datasetOutput,
  }: GraderArgs): number {
    let score = 0.0;

    // begin implementation

    const generatedToolCallArgsStr = generatedOutput.tool_calls?.[0]?.function.arguments;
    const datasetToolCallArgsStr = datasetOutput.tool_calls?.[0]?.function.arguments;

    if (!generatedToolCallArgsStr || !datasetToolCallArgsStr) {
      return 0.0;
    }

    type JudgementArgs = {
      explanation: string;
      score: number;
    };

    const generatedToolCallArgs = JSON.parse(generatedToolCallArgsStr) as JudgementArgs;
    const datasetToolCallArgs = JSON.parse(datasetToolCallArgsStr) as JudgementArgs;

    if (generatedToolCallArgs.score !== datasetToolCallArgs.score) {
      return 0.0;
    }

    score = 1.0;

    // end implementation

    return score;
  }

  interface GraderArgs {
    messages: ChatCompletionMessageParam;
    tools: ChatCompletionTool[] | null;
    toolChoice: "none" | "auto" | ChatCompletionNamedToolChoice | null;
    generatedOutput: ChatCompletionMessage;
    datasetOutput: ChatCompletionMessage;
  }

  interface ChatCompletionMessageToolCallFunction {
    name: string;
    arguments: string;
  }

  interface ChatCompletionMessageToolCall {
    function: ChatCompletionMessageToolCallFunction;
  }

  interface ChatCompletionMessage {
    content: string | null;
    refusal: string | null;
    tool_calls: ChatCompletionMessageToolCall[] | null;
  }

  type ChatCompletionMessageParam = ChatCompletionMessage;

  interface ChatCompletionTool {
    function: FunctionDefinition;
    type: "function";
  }

  interface FunctionDefinition {
    name: string;
    description?: string;
    parameters?: Record<string, unknown>;
  }

  export interface ChatCompletionNamedToolChoice {
    function: Function;
    type: "function";
  }

  interface Function {
    name: string;
  }

  function exactToolCallsMatch(
    toolCalls1: ChatCompletionMessageToolCall[] | null,
    toolCalls2: ChatCompletionMessageToolCall[] | null,
  ): boolean {
    // If either list is null, they can only match if both are null
    if (!toolCalls1 && !toolCalls2) {
      return true;
    }
    if (!toolCalls1 || !toolCalls2) {
      return false;
    }

    // Check if lengths match
    if (toolCalls1.length !== toolCalls2.length) {
      return false;
    }

    // Compare each tool call
    for (let i = 0; i < toolCalls1.length; i++) {
      const call1 = toolCalls1[i];
      const call2 = toolCalls2[i];

      // Compare all fields that must match exactly
      if (
        call1?.function.name !== call2?.function.name ||
        call1?.function.arguments !== call2?.function.arguments
      ) {
        return false;
      }
    }

    // If we made it through all comparisons, the calls match exactly
    return true;
  }
  ```
</Accordion>

In most cases, you'll want to start from one of the templates and customize the grader function to run the checks you care about. You can also use the **Custom** template to start from scratch.

<Info>
  Currently, the code evaluation framework only supports TypeScript code executed in a sandbox
  environment without access to the internet, external npm packages, or a file system. If you're
  interested in writing evals in other languages or need more advanced features, please let us know
  at [support@openpipe.ai](mailto:support@openpipe.ai).
</Info>
