Runtime Evaluation

Offline Testing

OpenPipe

Sign In

Use the Criteria API for runtime evaluation and offline testing.

API Endpoints

 Software engineers and data scientists use OpenPipe's intuitive fine-tuning and monitoring services to decrease the cost and latency of their LLM operations.  You can use OpenPipe to collect and analyze LLM logs, create fine-tuned models, and compare output from multiple models given the same input.

Introduction

OpenPipe Documentation

OpenPipe is a streamlined platform designed to help product-focused teams train specialized LLM models as replacements for slow and expensive prompts.

Overview

Train and compare across a range of the most powerful base models.

Base Models

Get started with OpenPipe in a few quick steps.

Quick Start

Installing the SDK

 Record production data to train and improve your models' performance.

Logging Requests

Logging Anthropic Requests

 Export your past requests as a JSONL file in their raw form.

Exporting Logs

Collect, evaluate, and refine your training data.

Datasets

Create your first dataset and import training data.

Datasets Quick Start

 Search and filter your past LLM requests to inspect your responses and build a training dataset.

Importing Request Logs

 Upload external data to kickstart your fine-tuning process. Use the OpenAI chat fine-tuning format.

Uploading Data

Use powerful models to generate new outputs for your data before training.

Relabeling Data

Exporting Data

Train your first fine-tuned model with OpenPipe.

Fine-Tuning Quick Start

 Fine tune your models on filtered logs or uploaded datasets. Filter by prompt id and exclude requests with an undesirable output.

Webapp

Fine Tuning via Webapp

 Fine tune your models programmatically through our API.

API (Beta)

Fine Tuning via API (Beta)

Direct Preference Optimization (DPO)

Train your first DPO fine-tuned model with OpenPipe.

DPO Quick Start

 Evaluate your fine-tuned models against comparison LLMs like GPT-4 and GPT-4-Turbo. Add and remove models from the evaluation, and customize the evaluation criteria.

Evaluations

Create your first head to head evaluation.

Evaluations Quick Start

Align LLM judgements with human ratings to evaluate and improve your models.

Criteria

Criteria Quick Start

Use alignment sets to test and improve your criteria.

Alignment Sets

Criterion Alignment Sets

Chat Completions

Custom External Models

Anthropic Proxy

Gemini Proxy

Mixture of Agents

Mixture of Agents Chat Completions

Use Prompt Prefilling to control the initial output of the completion.

Prompt Prefilling

 Improve performance and reduce costs by caching previously generated responses.

Caching

Updating Metadata Tags

Decrease input token counts by pruning out chunks of static text.

Pruning Rules

 Safeguard your application against potential failures, timeouts, or instabilities that may occur when using experimental or newly released models.

Fallback

Fallback options

Use Mixture of Agents to increase quality beyond SOTA models.

Report

Record request logs from Anthropic models

Report Anthropic

Update tags metadata for logged calls matching the provided filters.

Update Metadata

OpenAI-compatible route for generating inference and optionally logging the request.

Welcome

Getting Started

Features

API Reference

Pricing

API Endpoints

Runtime Evaluation

Offline Testing

Welcome

Getting Started

Features

API Reference

Pricing

​Runtime Evaluation

​Offline Testing

Runtime Evaluation

Offline Testing