Principle:Langfuse Langfuse Observation Evaluation Scheduling

Knowledge Sources	Langfuse
Domains	LLM Evaluation, Job Orchestration
Last Updated	2026-02-14 00:00 GMT

Overview

Observation evaluation scheduling is the process of matching completed LLM generation observations against configured evaluation rules, applying filtering and sampling criteria, and dispatching qualifying observations to an asynchronous evaluation queue for LLM-as-judge scoring.

Description

After an experiment executes an LLM call and produces a generation observation, the platform needs to determine whether any automated evaluations should be applied to that observation. This is governed by evaluation configurations ("eval configs") that project administrators define in advance. Each eval config specifies:

Target object: Whether the config applies to general events, experiments, or specific observation types.
Filter conditions: Optional filter state that constrains which observations are eligible (e.g., only observations with a specific prompt name or environment).
Sampling rate: A decimal between 0 and 1 that controls what fraction of matching observations are actually evaluated, allowing cost-controlled evaluation at scale.
Eval template: The LLM-as-judge template to use for scoring.

The scheduling process must be efficient: it should avoid uploading observation data to S3 unless at least one eval config matches. When multiple configs match, the observation data should be uploaded only once and shared across all configs. Each matching config results in an independent job execution record and queue entry.

This same scheduling mechanism is used for both externally-ingested observations (via the OTEL ingestion pipeline) and internally-generated experiment observations, providing a unified evaluation framework.

Usage

Observation evaluation scheduling is used when:

An experiment LLM call completes and produces generation details.
An externally-ingested observation arrives via the OTEL pipeline and matches event-targeted eval configs.
A project has configured LLM-as-judge evaluations that should be automatically applied to qualifying observations.

Theoretical Basis

The scheduling algorithm follows a filter-sample-upload-dispatch pattern:

Phase 1 -- Configuration Matching

For each eval config associated with the project, two checks are performed:

Filter evaluation: The observation's properties are tested against the config's filter conditions using an in-memory filter service. The filter supports various column types (name, environment, prompt name, etc.) and comparison operators. An empty filter matches all observations. For experiment-targeted configs, an additional constraint requires that the observation be the root span of the experiment item (i.e., span_id == experiment_item_root_span_id).
Sampling: A random sampling check is applied using the config's sampling rate. An observation that passes the filter may still be excluded by sampling, enabling cost-effective evaluation of high-volume workloads.

Only configs that pass both checks proceed.

Phase 2 -- Data Upload

If at least one config matches, the observation data is uploaded to S3 (or S3-compatible storage) exactly once. The upload returns an S3 path that is shared across all matching configs. This avoids redundant uploads when multiple eval configs target the same observation.

Phase 3 -- Job Dispatch

For each matching config, two operations are performed concurrently:

A job execution record is created in the database with status PENDING. The record links the eval config, the observation, and the eval template. The job execution ID is generated deterministically from the config ID and observation ID, enabling idempotent re-scheduling.
An eval job is enqueued to the LLMAsJudgeExecution queue with the job execution ID, project ID, and S3 path. The job has zero delay (it should execute as soon as a worker is available).

FUNCTION scheduleObservationEvals(observation, configs, deps):
    IF configs is empty:
        RETURN

    matchingConfigs = []
    FOR EACH config IN configs:
        IF NOT evaluateFilter(observation, config):
            CONTINUE
        IF NOT shouldSample(config.samplingRate):
            CONTINUE
        matchingConfigs.append(config)

    IF matchingConfigs is empty:
        RETURN

    s3Path = deps.uploadObservationToS3(observation)

    FOR EACH config IN matchingConfigs (parallel):
        jobId = deterministicId(config.id, observation.span_id)
        deps.createJobExecution(jobId, config, observation, status=PENDING)
        deps.enqueueEvalJob(jobId, observation.project_id, s3Path, delay=0)

Filter Evaluation Detail

The filter evaluation uses an in-memory filter service that maps filter column identifiers to observation field values. For experiment configs, the filter result is ANDed with a root-span check:

FUNCTION evaluateFilter(observation, config):
    isExperiment = (config.targetObject == "experiment")
    isRootSpan = (observation.span_id == observation.experiment_item_root_span_id)
    isEmptyFilter = (config.filter is null OR empty)

    filterMatch = isEmptyFilter ? true : InMemoryFilterService.evaluate(observation, config.filter)

    IF isExperiment:
        RETURN filterMatch AND isRootSpan
    ELSE:
        RETURN filterMatch

Related Pages

Implemented By

Implementation:Langfuse_Langfuse_ScheduleObservationEvals

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment