Before training, it’s useful to filter and refine your data in a pipeline. Pipelines are especially useful for monitoring deployed fine-tuned models and detecting any mistakes they make. This flow allows your models to get stronger over time, as you continually improve your data and train new models.

A pipeline includes up to 4 steps:

  • Request Log Import
  • LLM Filtering (optional)
  • LLM Relabeling (optional)
  • Connected Dataset

Let’s break down each step.

Request Log Import

Use SQL-based filters to import relevant request logs. In addition to filtering by model and keyword search within input and output, you can also filter using tags you attached while reporting your request logs (we recommend prompt_id as a start).

After defining your high-level filters, you can specify the Sample Rate and Max Rows of your import. Sample rate refers to the percentage of logs which match your filters that will be processed. Max rows defines the maximum number of logs that will be imported as part of this step. A single pipeline can process up to 20k rows.

Filtering

Sometimes SQL-based filters aren’t robust enough to find the data you want to process. For example, you may want to detect all logs that contain a response in a different language than the original input. In that case, you can use an LLM to look through your data and find the entries you want to process.

Only the entries that match your LLM filter will continue down the pipeline. Depending on the difficulty of your filtering task, you can use a cheap but weak model (like gpt-3.5-turbo) or a stronger but more expensive one (like gpt-4) to do the filtering.

Relabeling

If you want to improve the quality of your data, you can use an LLM to relabel it. This is useful if the model you’re using in production doesn’t handle some kinds of inputs well.

Again, depending on the difficulty of the task, you can select weaker or stronger models for relabeling. We highly recommend using the best model possible for the task at hand, since training data quality is crucial for model performance.

Connected Dataset

Finally, choose the dataset your pipeline should feed into! If you later choose to disconnect this dataset from your pipeline, or delete your pipeline altogether, the data processed by your pipeline will automatically be removed from your dataset. If you’d like to forward data to multiple datasets, create new pipelines with the same settings. LLM filtering and relabeling steps are cached, so you won’t be double charged for duplicate processing!