Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Langfuse Langfuse Evaluation Configuration

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Configuration Management
Last Updated 2026-02-14 00:00 GMT

Overview

Evaluation Configuration is the practice of defining reusable LLM-as-a-judge evaluation criteria through templates and binding those templates to production traces via job configurations with filter, sampling, and delay rules.

Description

In any LLM engineering platform, the ability to systematically evaluate the quality of AI-generated outputs is essential. Evaluation Configuration addresses this need by decomposing the problem into two distinct artifacts:

  1. Eval Templates define what to evaluate. A template encapsulates a prompt (with variable placeholders), the LLM model and provider to use, model parameters (temperature, max tokens, etc.), and an output schema that constrains the LLM judge's response to a structured format containing a numeric score and a textual reasoning. Templates are versioned by name within a project, allowing iterative refinement without disrupting running evaluations.
  1. Job Configurations define when and how to evaluate. A job configuration binds a template to a target object type (trace or dataset), applies optional filters to select only matching traces, maps template variables to specific columns of trace or observation data, sets a sampling rate (0 to 1) to control evaluation volume, and specifies a delay (in milliseconds) to allow trace data to settle before evaluation begins.

This separation of concerns enables teams to maintain a library of evaluation criteria independently from the operational rules that govern their application. A single template can be referenced by multiple job configurations with different filter and sampling settings.

Usage

Use Evaluation Configuration when:

  • You need to define a new quality metric for LLM outputs (e.g., relevance, factual accuracy, tone)
  • You want to apply the same evaluation logic across different trace populations with varying filter criteria
  • You need to control the cost of evaluations by applying sampling rates less than 1.0
  • You want to evaluate both new incoming traces and historical (existing) traces retroactively
  • You are iterating on an evaluation prompt and need to version templates while keeping job configurations stable

Theoretical Basis

The Evaluation Configuration principle follows a template-binding pattern that separates the evaluation definition from its application context:

Step 1 - Template Creation:

TEMPLATE = {
  name: unique identifier within project,
  version: auto-incremented integer,
  prompt: string with {{variable}} placeholders,
  model: LLM model identifier (or null for project default),
  provider: LLM provider adapter (or null for project default),
  modelParams: { temperature, maxTokens, topP, ... },
  outputSchema: {
    score: description string for the numeric score field,
    reasoning: description string for the reasoning field
  },
  vars: list of variable names referenced in the prompt
}

Step 2 - Template Validation:

Before persisting a template, the system validates the model configuration by:

  1. Resolving the model provider and API key (using explicit values or project-level defaults)
  2. Making a test structured-output call to the LLM to confirm the key and model support structured output
  3. Rejecting the template if validation fails, preventing broken evaluations from being created

Step 3 - Job Configuration Creation:

JOB_CONFIG = {
  evalTemplateId: reference to a specific template version,
  scoreName: name of the score to produce,
  target: "trace" or "dataset",
  filter: array of filter conditions on trace/dataset columns,
  variableMapping: array mapping template vars to data sources,
  sampling: float in (0, 1] controlling execution probability,
  delay: milliseconds to wait before executing evaluation,
  timeScope: ["NEW"] or ["EXISTING"] or ["NEW", "EXISTING"],
  status: "ACTIVE" | "INACTIVE"
}

Step 4 - Time Scope Handling:

When a job configuration includes "EXISTING" in its time scope, a batch action is enqueued to retroactively apply the evaluation to all matching historical traces. This allows teams to backfill scores for traces that were ingested before the evaluator was created.

Step 5 - Template Version Propagation:

When a new version of a template is created with the UPDATE referencing mode, all active job configurations that reference any prior version of the same template are automatically updated to reference the new version. This ensures evaluations always use the latest prompt without manual reconfiguration.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment