Principle:Langfuse Langfuse Evaluation Job Creation
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, Workflow Orchestration |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Evaluation Job Creation is the principle of matching incoming trace or dataset events against active evaluation configurations, applying filters, deduplication, and sampling, then creating individual job execution records and enqueuing them for downstream LLM evaluation.
Description
Once an event (trace upsert, dataset run item upsert, or UI-triggered batch action) arrives from the triggering layer, the system must determine which evaluation configurations apply, verify that the target data exists, and create concrete execution records. Evaluation Job Creation encapsulates this decision logic.
The process involves multiple stages of filtering and validation:
- Configuration Retrieval -- All active job configurations for the event's project are fetched from the database. Configurations are filtered by type (EVAL), status (ACTIVE), and target object (TRACE or DATASET). If an enforced time scope is provided (e.g., "NEW" for live trace events), only configurations with a matching time scope are included.
- Infinite Loop Prevention -- Traces with environments prefixed by "langfuse-" are identified as internal evaluation traces and are excluded from job creation when the event source is trace-upsert. This prevents the infinite cycle: user trace produces eval trace, which triggers another eval, and so on.
- Trace Existence and Filter Validation -- The system verifies the trace exists in ClickHouse and applies the evaluation's filter conditions. An in-memory filter optimization is used when the trace data has already been fetched for other configurations, avoiding redundant ClickHouse queries.
- Dataset Item Resolution -- For dataset-targeted evaluations, the system resolves the dataset item linked to the trace. This involves looking up dataset items by trace ID with optional version matching and applying any dataset-level filter conditions.
- Observation Existence Check -- When the event references a specific observation (common in dataset run items linked at the observation level), the system verifies the observation exists. If not found, an ObservationNotFoundError is thrown to trigger a retry, accommodating data replication delays.
- Deduplication -- Existing job executions for the same configuration and trace are checked in a single batched query. If an execution already exists, the new event is skipped to prevent duplicate evaluations.
- Sampling -- The configuration's sampling rate (a float between 0 and 1) is applied probabilistically. A random number is generated, and if it exceeds the sampling rate, the job is skipped.
- Execution Creation and Enqueuing -- A new job execution record is created with PENDING status, and a message is enqueued to the EvalExecutionQueue with an optional delay (in milliseconds) to allow trace data to settle.
Usage
Use Evaluation Job Creation when:
- You need to understand how the system decides which evaluations to run for a given trace
- You want to trace why an evaluation was or was not created for a specific trace
- You are debugging evaluation deduplication or sampling behavior
- You need to understand the data flow between the triggering layer and the execution layer
Theoretical Basis
The Evaluation Job Creation principle implements a filter-sample-deduplicate-enqueue pipeline:
Step 1 - Configuration Matching:
configs = QUERY job_configurations
WHERE job_type = "EVAL"
AND project_id = event.projectId
AND status = "ACTIVE"
AND target_object IN ("trace", "dataset")
AND (configId IS NULL OR id = event.configId)
AND (enforcedTimeScope IS NULL OR time_scope CONTAINS enforcedTimeScope)
Step 2 - Caching Optimization:
IF configs.length > 1:
cachedTrace = FETCH trace from ClickHouse (excluding input/output for performance)
cachedDatasetItemIds = FETCH dataset item IDs from ClickHouse
allExistingJobs = BATCH QUERY job_executions
WHERE project_id = event.projectId
AND trace_id = event.traceId
AND config_id IN configs.map(id)
Step 3 - Per-Config Processing Loop:
FOR EACH config IN configs:
// Skip inactive configs
IF config.status == INACTIVE: CONTINUE
// Check trace existence with filter
IF cachedTrace AND filter is in-memory evaluable:
traceExists = IN_MEMORY_FILTER(cachedTrace, config.filter)
ELSE:
traceExists = DATABASE_LOOKUP(traceId, filter)
// Resolve dataset item if applicable
IF config.target == "dataset":
datasetItem = RESOLVE_DATASET_ITEM(event, config.filter)
// Skip observation-level dataset evals from trace-upsert source
IF source == "trace-upsert" AND datasetItem.observationId EXISTS:
CONTINUE
// Check observation existence if referenced
IF event.observationId:
IF NOT observationExists(observationId):
THROW ObservationNotFoundError (triggers retry)
// Deduplication check
existingJob = FIND_MATCHING_JOB(config.id, datasetItemId, observationId)
IF traceExists AND (NOT datasetConfig OR datasetItem EXISTS):
IF existingJob: CONTINUE // Already evaluated
// Sampling
IF config.sampling != 1:
IF random() > config.sampling: CONTINUE // Sampled out
// Create execution and enqueue
CREATE jobExecution(status: PENDING)
ENQUEUE to EvalExecutionQueue(delay: config.delay)
ELSE:
// Cancel stale execution if trace no longer matches
IF existingJob AND existingJob.status != COMPLETED:
UPDATE existingJob SET status = CANCELLED
Cancellation Semantics:
An important aspect of this pipeline is that trace updates can deselect a trace from an evaluation. If a trace previously matched an evaluation's filter but no longer matches after an update, any pending (non-completed) execution for that trace is cancelled. This ensures evaluations reflect the most current state of trace data.
Event Loop Yielding:
The per-config processing loop yields to the Node.js event loop between iterations using setImmediate(). This prevents long-running evaluation job creation from blocking other tasks in the worker process.