Principle:Langfuse Langfuse Observation Evaluation Scheduling
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, Job Orchestration |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Observation evaluation scheduling is the process of matching completed LLM generation observations against configured evaluation rules, applying filtering and sampling criteria, and dispatching qualifying observations to an asynchronous evaluation queue for LLM-as-judge scoring.
Description
After an experiment executes an LLM call and produces a generation observation, the platform needs to determine whether any automated evaluations should be applied to that observation. This is governed by evaluation configurations ("eval configs") that project administrators define in advance. Each eval config specifies:
- Target object: Whether the config applies to general events, experiments, or specific observation types.
- Filter conditions: Optional filter state that constrains which observations are eligible (e.g., only observations with a specific prompt name or environment).
- Sampling rate: A decimal between 0 and 1 that controls what fraction of matching observations are actually evaluated, allowing cost-controlled evaluation at scale.
- Eval template: The LLM-as-judge template to use for scoring.
The scheduling process must be efficient: it should avoid uploading observation data to S3 unless at least one eval config matches. When multiple configs match, the observation data should be uploaded only once and shared across all configs. Each matching config results in an independent job execution record and queue entry.
This same scheduling mechanism is used for both externally-ingested observations (via the OTEL ingestion pipeline) and internally-generated experiment observations, providing a unified evaluation framework.
Usage
Observation evaluation scheduling is used when:
- An experiment LLM call completes and produces generation details.
- An externally-ingested observation arrives via the OTEL pipeline and matches event-targeted eval configs.
- A project has configured LLM-as-judge evaluations that should be automatically applied to qualifying observations.
Theoretical Basis
The scheduling algorithm follows a filter-sample-upload-dispatch pattern:
Phase 1 -- Configuration Matching
For each eval config associated with the project, two checks are performed:
- Filter evaluation: The observation's properties are tested against the config's filter conditions using an in-memory filter service. The filter supports various column types (name, environment, prompt name, etc.) and comparison operators. An empty filter matches all observations. For experiment-targeted configs, an additional constraint requires that the observation be the root span of the experiment item (i.e.,
span_id == experiment_item_root_span_id). - Sampling: A random sampling check is applied using the config's sampling rate. An observation that passes the filter may still be excluded by sampling, enabling cost-effective evaluation of high-volume workloads.
Only configs that pass both checks proceed.
Phase 2 -- Data Upload
If at least one config matches, the observation data is uploaded to S3 (or S3-compatible storage) exactly once. The upload returns an S3 path that is shared across all matching configs. This avoids redundant uploads when multiple eval configs target the same observation.
Phase 3 -- Job Dispatch
For each matching config, two operations are performed concurrently:
- A job execution record is created in the database with status
PENDING. The record links the eval config, the observation, and the eval template. The job execution ID is generated deterministically from the config ID and observation ID, enabling idempotent re-scheduling. - An eval job is enqueued to the
LLMAsJudgeExecutionqueue with the job execution ID, project ID, and S3 path. The job has zero delay (it should execute as soon as a worker is available).
FUNCTION scheduleObservationEvals(observation, configs, deps):
IF configs is empty:
RETURN
matchingConfigs = []
FOR EACH config IN configs:
IF NOT evaluateFilter(observation, config):
CONTINUE
IF NOT shouldSample(config.samplingRate):
CONTINUE
matchingConfigs.append(config)
IF matchingConfigs is empty:
RETURN
s3Path = deps.uploadObservationToS3(observation)
FOR EACH config IN matchingConfigs (parallel):
jobId = deterministicId(config.id, observation.span_id)
deps.createJobExecution(jobId, config, observation, status=PENDING)
deps.enqueueEvalJob(jobId, observation.project_id, s3Path, delay=0)
Filter Evaluation Detail
The filter evaluation uses an in-memory filter service that maps filter column identifiers to observation field values. For experiment configs, the filter result is ANDed with a root-span check:
FUNCTION evaluateFilter(observation, config):
isExperiment = (config.targetObject == "experiment")
isRootSpan = (observation.span_id == observation.experiment_item_root_span_id)
isEmptyFilter = (config.filter is null OR empty)
filterMatch = isEmptyFilter ? true : InMemoryFilterService.evaluate(observation, config.filter)
IF isExperiment:
RETURN filterMatch AND isRootSpan
ELSE:
RETURN filterMatch