Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Langfuse Langfuse Evaluation pipeline

From Leeroopedia
Knowledge Sources
Domains LLM_Ops, Evaluation, LLM_As_Judge
Last Updated 2026-02-14 05:00 GMT

Overview

End-to-end process for automatically evaluating LLM application traces and observations using configurable LLM-as-judge evaluation templates, from configuration setup through score creation and persistence.

Description

This workflow describes how Langfuse's evaluation system uses LLM-as-judge to automatically score traces and observations. Users configure evaluation templates (defining LLM prompts, output schemas, and model settings) and job configurations (defining filters, variable mappings, sampling rates, and delays). When new traces are ingested or historical traces are retroactively evaluated, the system creates evaluation jobs, extracts variables from trace data, compiles prompts, calls an LLM for structured scoring, and persists the resulting scores back into the system. The pipeline supports both trace-level and observation-level evaluations with comprehensive error handling and retry logic.

Usage

Execute this workflow when you need automated quality evaluation of LLM traces. This is used when teams configure evaluation templates with custom scoring criteria and want those evaluations applied automatically to new incoming traces, retroactively to historical traces, or to specific observations within traces.

Execution Steps

Step 1: Evaluation Configuration

Users create evaluation templates and job configurations through the tRPC API. Templates define the LLM prompt with variable placeholders, the expected output schema (score value and reasoning), and model configuration. Job configurations specify which traces to evaluate (via filters), how to map template variables to trace fields, sampling probability, execution delay, and time scope (new traces, existing traces, or both).

Key considerations:

  • Templates support project-level and Langfuse-managed variants
  • Job configurations can target TRACE, DATASET, or OBSERVATION objects
  • A test LLM call is made during template creation to validate the configuration
  • Time scope controls whether evaluations run on live data, historical data, or both

Step 2: Evaluation Job Triggering

Evaluation jobs are triggered from three sources: trace upsert events (for live evaluations), dataset run item upsert events (for dataset evaluations), and batch action queue events (for retroactive historical evaluations). Each trigger dispatches a message to the CreateEvalQueue with the relevant trace or dataset context.

Key considerations:

  • Internal Langfuse traces (environment starting with "langfuse-") are excluded to prevent infinite evaluation loops
  • A Redis cache optimization marks projects with no active eval configs to skip processing
  • Multiple trigger sources converge into a single CreateEvalQueue for unified handling

Step 3: Evaluation Job Creation

The worker processes CreateEvalQueue jobs by fetching all active evaluation configurations for the project, checking trace existence and filter matches, applying sampling, and deduplicating against existing job executions. For each matching configuration, a job_execution record is created with PENDING status and queued to the EvalExecutionQueue with the configured delay.

Key considerations:

  • Filter evaluation uses InMemoryFilterService for optimization where possible
  • Observation existence is validated with retry on ObservationNotFoundError
  • Jobs are cancelled if a trace no longer matches updated filters
  • Deduplication prevents duplicate evaluations for the same config+trace+observation combination

Step 4: Variable Extraction

When the evaluation execution begins, variables are extracted from the trace or observation data. For trace-level evaluations, trace and observation data is fetched from ClickHouse and PostgreSQL, with JSONPath selectors applied according to the variable mapping. For observation-level evaluations, observation data is downloaded from S3 where it was stored during ingestion.

Key considerations:

  • Variable extraction supports JSONPath selectors for deep field access
  • Caching prevents duplicate lookups when multiple variables reference the same trace
  • Observation-level evals store observation snapshots in S3 for consistent evaluation

Step 5: Prompt Compilation and LLM Execution

Extracted variables are substituted into the evaluation template prompt. The output schema is converted to a Zod schema for LLM structured output. The compiled prompt is sent to the configured LLM provider (OpenAI, Anthropic, Google, Azure, Bedrock) with structured output constraints requiring a numeric score and reasoning string.

Key considerations:

  • Falls back to the project's default evaluation model if no model is specified in the template
  • LLM calls create internal Langfuse traces for debugging (environment: langfuse-llm-as-a-judge)
  • Multiple LLM providers are supported through a unified fetchLLMCompletion interface
  • Structured output ensures consistent score format across different models

Step 6: Response Validation and Score Creation

The LLM response is validated against the expected schema. A score event is constructed with the numeric score value, reasoning text, source type EVAL, and references to the evaluated trace and observation. The score is uploaded to S3 and enqueued for ingestion through the standard score ingestion pipeline. The job_execution record is updated to COMPLETED status.

Key considerations:

  • Invalid LLM responses throw UnrecoverableError to prevent futile retries
  • Score events include execution trace IDs for end-to-end debugging
  • The score is persisted through the same ingestion pipeline used for SDK-submitted scores
  • Job execution records track start time, end time, status, and output score ID

Step 7: Error Handling and Retry

Errors are classified as retryable or non-retryable. LLM rate limits (HTTP 429) and server errors (5xx) trigger delayed retries with exponential backoff up to a 24-hour job age limit. Client errors (4xx except 429), invalid schemas, and missing configurations are non-retryable and immediately mark the job as ERROR. Failed jobs record error messages and timestamps for debugging.

Key considerations:

  • Maximum job age for retries is 24 hours from creation
  • Retry delays range from 1 to 25 minutes with exponential backoff
  • Jobs in DELAYED status are rescheduled rather than creating new executions
  • BullMQ provides up to 5 additional retry attempts at the queue level

Execution Diagram

GitHub URL

Workflow Repository