Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Langfuse Langfuse Experiment Results Aggregation

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Data Analytics
Last Updated 2026-02-14 00:00 GMT

Overview

Results aggregation is the process of collecting, grouping, and summarizing evaluation scores and operational metrics across all items in a dataset run, producing per-run aggregate statistics that enable meaningful comparison between experiment configurations.

Description

After an experiment completes and evaluations have been scored, the platform needs to present results in a form that supports decision-making. Raw per-item scores are useful for debugging individual cases, but comparing two experiment runs (e.g., GPT-4 at temperature 0.3 vs. Claude at temperature 0.7) requires aggregate metrics that summarize performance across the entire dataset.

Results aggregation addresses this by computing summary statistics at two levels:

  1. Score aggregation: Individual scores are grouped by a composite key that combines the normalized score name, score source (API, annotation, or automated evaluator), and data type (numeric, categorical, boolean). For numeric scores, the aggregate is an arithmetic average. For categorical and boolean scores, the aggregate is a frequency distribution (value counts).
  2. Run-level metrics: Operational metrics such as count of run items, average total cost, total cost, and average latency are computed from the underlying trace and observation data, providing a cost-performance tradeoff view.

This dual-level aggregation enables users to answer questions like: "Which model configuration produces the highest average correctness score at the lowest cost?"

Usage

Results aggregation is used when:

  • The experiment results page needs to display per-run summary metrics in a comparison table.
  • A user wants to compare multiple runs within a dataset to identify the best-performing configuration.
  • Aggregated scores need to be computed for display in dashboard widgets or exported for analysis.

Theoretical Basis

Score Grouping Key

Scores are grouped by a composite key with three components:

  1. Normalized name: The score name with hyphens and dots replaced by underscores, because hyphens and dots are reserved as delimiters in the key format.
  2. Source: The origin of the score (e.g., "API", "ANNOTATION", "EVAL").
  3. Data type: One of "NUMERIC", "CATEGORICAL", or "BOOLEAN".

The key format is: {normalizedName}-{source}-{dataType}

This key ensures that scores with the same name but different sources or data types are aggregated separately, preventing accidental mixing of human annotations with automated evaluator scores or numeric scores with categorical ones.

Numeric Aggregation

For scores with data type "NUMERIC", the aggregation computes:

  • values: The array of all numeric score values (using 0 as a fallback for null values).
  • average: The arithmetic mean of all values: sum(values) / count(values).

When there is exactly one value, the aggregate also preserves the individual score's metadata (comment, id, hasMetadata, timestamp) for direct access.

Categorical and Boolean Aggregation

Boolean scores are treated as categorical for aggregation purposes, since both require value counting rather than averaging.

For categorical/boolean scores, the aggregation computes:

  • values: The array of all string values (using "n/a" as a fallback for null stringValue).
  • valueCounts: An array of { value, count } pairs representing the frequency distribution.

Run-Level Metrics

Beyond score aggregation, the results pipeline also computes per-run operational metrics by querying ClickHouse:

  • countRunItems: Number of dataset run items in the run.
  • avgTotalCost: Average cost per item (across all token usage).
  • totalCost: Sum of all costs for the run.
  • avgLatency: Average end-to-end latency per item.

These metrics are fetched via a dedicated ClickHouse query and merged with the score aggregates.

FUNCTION aggregateScores(scores):
    grouped = GROUP scores BY composeKey(normalize(score.name), score.source, score.dataType)

    result = {}
    FOR EACH (key, groupScores) IN grouped:
        aggregateType = resolveType(groupScores[0].dataType)

        IF aggregateType == "NUMERIC":
            values = groupScores.map(s => s.value ?? 0)
            result[key] = {
                type: "NUMERIC",
                values: values,
                average: sum(values) / len(values),
            }
        ELSE:  // CATEGORICAL or BOOLEAN
            values = groupScores.map(s => s.stringValue ?? "n/a")
            valueCounts = countOccurrences(values)
            result[key] = {
                type: "CATEGORICAL",
                values: values,
                valueCounts: valueCounts,
            }

    RETURN result

FUNCTION getRunMetrics(projectId, datasetId, runIds):
    runsMetrics = queryClickHouse(projectId, datasetId, runIds)
    traceScores = getTraceScoresForRuns(projectId, runIds)
    runScores = getScoresForRuns(projectId, runIds)

    RETURN runsMetrics.map(run => {
        id, name, countRunItems, avgTotalCost, totalCost, avgLatency,
        scores: aggregateScores(traceScores.filter(run.id)),
        runScores: aggregateScores(runScores.filter(run.id)),
    })

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment