Workflow:Langfuse Langfuse Dataset experiment pipeline

Knowledge Sources	Langfuse Langfuse Docs Langfuse Datasets
Domains	LLM_Ops, Experimentation, Evaluation, Datasets
Last Updated	2026-02-14 05:00 GMT

Overview

End-to-end process for running prompt experiments against datasets in Langfuse, from configuration validation through LLM execution on each dataset item to score aggregation and results visualization.

Description

This workflow describes how Langfuse enables users to systematically test prompt changes against curated datasets. Users configure an experiment by selecting a prompt version, a dataset, and a model configuration. The system validates the configuration, creates a dataset run record, and enqueues the experiment for asynchronous execution. The worker processes each dataset item by substituting item data into the prompt, calling the configured LLM, capturing the generation as a traced observation, scheduling automated evaluations, and linking results back to the dataset run. Results are queried and aggregated across ClickHouse and PostgreSQL for comparison across experiment runs.

Usage

Execute this workflow when you want to systematically evaluate how a prompt change affects output quality across a standardized set of test cases. This is used for A/B testing prompt versions, comparing model configurations, validating prompt changes before production deployment, and building regression test suites for LLM applications.

Execution Steps

Step 1: Experiment Configuration and Validation

The user selects a prompt, dataset, and model configuration through the experiment UI. The system validates the configuration by resolving the prompt to its production version, extracting prompt variables, loading dataset items, and checking that dataset item fields can satisfy the prompt's variable requirements. The validation returns the total item count and a map of resolvable variables.

Key considerations:

The prompt is resolved to a specific version (usually production label)
Dataset items can be filtered by version for reproducible experiments
Variable mapping validates that dataset item input fields match prompt variable names
The LLM API key configuration is verified for the selected provider and model

Step 2: Experiment Run Creation

A DatasetRuns record is created in PostgreSQL with metadata capturing the full experiment configuration: prompt ID, provider, model, model parameters, experiment name, run name, dataset version, and structured output schema. The experiment is then enqueued to the ExperimentCreateQueue for asynchronous processing.

Key considerations:

Run metadata preserves the exact configuration for reproducibility
Structured output schemas are stored for experiments requiring JSON-mode LLM output
The queue supports 10 retry attempts with exponential backoff (10-second initial delay)
Experiment names and run names enable organized comparison in the UI

Step 3: Dataset Item Processing

The worker iterates over all active dataset items (optionally filtered by version) that have valid variable formats and have not already been processed in this run. For each item, a dataset-run-item-create event is generated with a deterministic trace ID derived from the run ID and dataset item ID, ensuring consistent linking.

Key considerations:

Items already processed in this run are skipped (deduplication)
Deterministic trace IDs use W3C trace ID format for cross-system compatibility
Items with invalid variable formats are filtered out during the processing loop
The dataset item's input, expected output, and metadata are all available for variable substitution

Step 4: LLM Execution per Item

For each dataset item, the prompt variables are replaced with values from the dataset item's input fields. The compiled prompt is sent to the configured LLM via the unified fetchLLMCompletion interface, which supports OpenAI, Anthropic, Google (Vertex AI and AI Studio), Azure, and Bedrock. The LLM call is instrumented with an internal trace sink that captures the generation details (input, output, metadata) as a Langfuse trace in the project.

Key considerations:

Internal traces are created with environment "langfuse-prompt-experiment" to distinguish from user traces
Trace metadata includes dataset ID, experiment name, prompt reference, and item metadata
Structured output schemas (JSON mode) are passed to the LLM when configured
Generation details (observation ID, input, output) are captured via callback for linking

Step 5: Observation Evaluation Scheduling

After a successful LLM call, the system checks for observation-level evaluation configurations and schedules evaluations for the generated observation. This allows automated scoring (e.g., LLM-as-judge) to run on each experiment output, providing quantitative quality metrics alongside the raw outputs.

Key considerations:

Observation evaluation configs are fetched once per experiment for efficiency
Evaluation scheduling reuses the same pipeline as production trace evaluations
Scores from evaluations are linked to both the observation and the dataset run item
Failed LLM calls skip evaluation scheduling

Step 6: Dataset Run Item Upsert

A dataset run item upsert job is queued with a 30-second delay to allow the LLM call trace to settle in ClickHouse before post-processing. This job links the trace and observation to the dataset run item record, enabling the results to be queried alongside the original dataset item data.

Key considerations:

The 30-second delay prevents race conditions with ClickHouse eventual consistency
Upsert jobs include trace ID, dataset item ID, observation ID, and version metadata
The queue supports 5 retry attempts with exponential backoff
Failed upserts are retried independently of the main experiment execution

Step 7: Results Aggregation and Visualization

Experiment results are queried through the dataset router's metrics and run items procedures. Metrics are aggregated from ClickHouse's dataset_run_items table, including run item counts, average cost, total cost, and average latency. Scores from both trace-level and observation-level evaluations are fetched and aggregated using statistical functions (mean, median, min, max for numeric; category counts for categorical). Results are presented for comparison across experiment runs.

Key considerations:

Two-stage score aggregation combines trace-level and run-item-level scores
Statistical aggregation (mean, median, min, max) enables quantitative comparison
Filtering and pagination support large datasets with many items and runs
Cross-run comparison enables side-by-side evaluation of prompt changes

Execution Diagram

GitHub URL

Workflow Repository