Principle:Langfuse Langfuse Experiment Results Aggregation
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, Data Analytics |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Results aggregation is the process of collecting, grouping, and summarizing evaluation scores and operational metrics across all items in a dataset run, producing per-run aggregate statistics that enable meaningful comparison between experiment configurations.
Description
After an experiment completes and evaluations have been scored, the platform needs to present results in a form that supports decision-making. Raw per-item scores are useful for debugging individual cases, but comparing two experiment runs (e.g., GPT-4 at temperature 0.3 vs. Claude at temperature 0.7) requires aggregate metrics that summarize performance across the entire dataset.
Results aggregation addresses this by computing summary statistics at two levels:
- Score aggregation: Individual scores are grouped by a composite key that combines the normalized score name, score source (API, annotation, or automated evaluator), and data type (numeric, categorical, boolean). For numeric scores, the aggregate is an arithmetic average. For categorical and boolean scores, the aggregate is a frequency distribution (value counts).
- Run-level metrics: Operational metrics such as count of run items, average total cost, total cost, and average latency are computed from the underlying trace and observation data, providing a cost-performance tradeoff view.
This dual-level aggregation enables users to answer questions like: "Which model configuration produces the highest average correctness score at the lowest cost?"
Usage
Results aggregation is used when:
- The experiment results page needs to display per-run summary metrics in a comparison table.
- A user wants to compare multiple runs within a dataset to identify the best-performing configuration.
- Aggregated scores need to be computed for display in dashboard widgets or exported for analysis.
Theoretical Basis
Score Grouping Key
Scores are grouped by a composite key with three components:
- Normalized name: The score name with hyphens and dots replaced by underscores, because hyphens and dots are reserved as delimiters in the key format.
- Source: The origin of the score (e.g., "API", "ANNOTATION", "EVAL").
- Data type: One of "NUMERIC", "CATEGORICAL", or "BOOLEAN".
The key format is: {normalizedName}-{source}-{dataType}
This key ensures that scores with the same name but different sources or data types are aggregated separately, preventing accidental mixing of human annotations with automated evaluator scores or numeric scores with categorical ones.
Numeric Aggregation
For scores with data type "NUMERIC", the aggregation computes:
- values: The array of all numeric score values (using 0 as a fallback for null values).
- average: The arithmetic mean of all values:
sum(values) / count(values).
When there is exactly one value, the aggregate also preserves the individual score's metadata (comment, id, hasMetadata, timestamp) for direct access.
Categorical and Boolean Aggregation
Boolean scores are treated as categorical for aggregation purposes, since both require value counting rather than averaging.
For categorical/boolean scores, the aggregation computes:
- values: The array of all string values (using "n/a" as a fallback for null stringValue).
- valueCounts: An array of
{ value, count }pairs representing the frequency distribution.
Run-Level Metrics
Beyond score aggregation, the results pipeline also computes per-run operational metrics by querying ClickHouse:
- countRunItems: Number of dataset run items in the run.
- avgTotalCost: Average cost per item (across all token usage).
- totalCost: Sum of all costs for the run.
- avgLatency: Average end-to-end latency per item.
These metrics are fetched via a dedicated ClickHouse query and merged with the score aggregates.
FUNCTION aggregateScores(scores):
grouped = GROUP scores BY composeKey(normalize(score.name), score.source, score.dataType)
result = {}
FOR EACH (key, groupScores) IN grouped:
aggregateType = resolveType(groupScores[0].dataType)
IF aggregateType == "NUMERIC":
values = groupScores.map(s => s.value ?? 0)
result[key] = {
type: "NUMERIC",
values: values,
average: sum(values) / len(values),
}
ELSE: // CATEGORICAL or BOOLEAN
values = groupScores.map(s => s.stringValue ?? "n/a")
valueCounts = countOccurrences(values)
result[key] = {
type: "CATEGORICAL",
values: values,
valueCounts: valueCounts,
}
RETURN result
FUNCTION getRunMetrics(projectId, datasetId, runIds):
runsMetrics = queryClickHouse(projectId, datasetId, runIds)
traceScores = getTraceScoresForRuns(projectId, runIds)
runScores = getScoresForRuns(projectId, runIds)
RETURN runsMetrics.map(run => {
id, name, countRunItems, avgTotalCost, totalCost, avgLatency,
scores: aggregateScores(traceScores.filter(run.id)),
runScores: aggregateScores(runScores.filter(run.id)),
})