Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Langfuse Langfuse AggregateScores

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Data Analytics
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for aggregating experiment scores by composite key and computing per-run metrics, provided by Langfuse.

Description

The results aggregation system consists of two main components:

1. aggregateScores (pure function)

A generic function that takes an array of score objects and produces a ScoreAggregate -- a record keyed by composite score key with either numeric or categorical aggregate values.

The function operates in two steps:

  1. Grouping: Scores are grouped by a composite key generated by composeAggregateScoreKey, which combines the normalized name (hyphens and dots replaced with underscores via normalizeScoreName), the source type, and the data type into a string like correctness-EVAL-NUMERIC.
  2. Aggregation: For each group, resolveAggregateType maps the data type to either "NUMERIC" or "CATEGORICAL" (boolean scores are treated as categorical). Numeric groups produce an average; categorical groups produce value counts. When a group contains exactly one score, the aggregate preserves the individual score's comment, id, hasMetadata, and timestamp fields for direct access in the UI.

2. datasetRouter.runsByDatasetIdMetrics (tRPC query)

A tRPC query procedure that fetches per-run operational metrics from ClickHouse and combines them with aggregated trace scores and run scores. It:

  1. Calls getDatasetRunsTableMetricsCh to get countRunItems, avgTotalCost, totalCost, and avgLatency for each run.
  2. Fetches trace-level scores via getTraceScoresForDatasetRuns and run-level scores via getScoresForDatasetRuns.
  3. Maps each run to an object containing the metrics plus two separate ScoreAggregate objects: one for trace scores and one for run scores.

Usage

aggregateScores is used wherever scores need to be summarized: the dataset runs comparison table, experiment detail views, and dashboard widgets. runsByDatasetIdMetrics is the primary tRPC endpoint for the experiment results page, providing all the data needed to render the runs comparison table.

Code Reference

Source Location

  • Repository: langfuse
  • File (aggregateScores): web/src/features/scores/lib/aggregateScores.ts
  • Lines: 72-136
  • File (runsByDatasetIdMetrics): web/src/features/datasets/server/dataset-router.ts
  • Lines: 561-608

Signature

// Score aggregation function
export const aggregateScores = <T extends ScoreToAggregate>(
  scores: T[],
): ScoreAggregate => { ... }

// Supporting functions
export const composeAggregateScoreKey = ({
  name,
  source,
  dataType,
}: {
  name: string;
  source: ScoreSourceType;
  dataType: AggregatableScoreDataType;
}): string => { ... }

export const decomposeAggregateScoreKey = (
  key: string,
): {
  name: string;
  source: ScoreSourceType;
  dataType: AggregatableScoreDataType;
} => { ... }

export const normalizeScoreName = (name: string): string => { ... }

export const resolveAggregateType = (
  dataType: AggregatableScoreDataType,
): "NUMERIC" | "CATEGORICAL" => { ... }

// tRPC query for run metrics
datasetRouter.runsByDatasetIdMetrics: protectedProjectProcedure
  .input(datasetRunTableMetricsSchema)
  .query(async ({ input }) => { ... })

Import

import {
  aggregateScores,
  composeAggregateScoreKey,
  decomposeAggregateScoreKey,
  normalizeScoreName,
  resolveAggregateType,
} from "@/src/features/scores/lib/aggregateScores";

I/O Contract

Inputs (aggregateScores)

Name Type Required Description
scores T[] (extends ScoreToAggregate) Yes Array of score objects. Each must have name (string), source (ScoreSourceType), dataType (AggregatableScoreDataType), value (number or null), stringValue (string or null), comment (string or null), id (string), timestamp (Date or null), and optionally hasMetadata (boolean).

Outputs (aggregateScores)

Name Type Description
ScoreAggregate Record<string, NumericAggregate or CategoricalAggregate> A map from composite score key to aggregate data. Keys have format {normalizedName}-{source}-{dataType}.

NumericAggregate structure:

Field Type Description
type "NUMERIC" Discriminant field.
values number[] All individual score values.
average number Arithmetic mean of all values.
comment string or undefined Present only when there is exactly one score.
id string or undefined Present only when there is exactly one score.
hasMetadata boolean or undefined Present only when there is exactly one score.
timestamp Date or undefined Present only when there is exactly one score.

CategoricalAggregate structure:

Field Type Description
type "CATEGORICAL" Discriminant field.
values string[] All individual string values.
valueCounts Array<{ value: string; count: number }> Frequency distribution of values.
comment string or undefined Present only when there is exactly one score.
id string or undefined Present only when there is exactly one score.
hasMetadata boolean or undefined Present only when there is exactly one score.
timestamp Date or undefined Present only when there is exactly one score.

Inputs (runsByDatasetIdMetrics)

Name Type Required Description
projectId string Yes The project ID.
datasetId string Yes The dataset ID to fetch runs for.
runIds string[] No Optional list of specific run IDs to filter.
filter FilterState No Optional filter conditions for the runs query.

Outputs (runsByDatasetIdMetrics)

Name Type Description
runs Array<RunMetrics> Array of per-run metric objects.
runs[].id string The dataset run ID.
runs[].name string The dataset run name.
runs[].countRunItems number Number of dataset run items in this run.
runs[].avgTotalCost number or null Average cost per item.
runs[].totalCost number or null Total cost for the entire run.
runs[].avgLatency number or null Average latency per item in milliseconds.
runs[].scores ScoreAggregate Aggregated trace-level scores for this run.
runs[].runScores ScoreAggregate Aggregated run-level scores for this run.

Usage Examples

Aggregating Numeric Scores

import { aggregateScores } from "@/src/features/scores/lib/aggregateScores";

const scores = [
  { name: "correctness", source: "EVAL", dataType: "NUMERIC", value: 0.8, stringValue: null, comment: null, id: "s1", timestamp: new Date() },
  { name: "correctness", source: "EVAL", dataType: "NUMERIC", value: 0.9, stringValue: null, comment: null, id: "s2", timestamp: new Date() },
  { name: "correctness", source: "EVAL", dataType: "NUMERIC", value: 0.7, stringValue: null, comment: null, id: "s3", timestamp: new Date() },
];

const result = aggregateScores(scores);
// result = {
//   "correctness-EVAL-NUMERIC": {
//     type: "NUMERIC",
//     values: [0.8, 0.9, 0.7],
//     average: 0.8,
//     comment: undefined,
//     id: undefined,
//     hasMetadata: undefined,
//     timestamp: undefined,
//   }
// }

Aggregating Categorical Scores

const scores = [
  { name: "sentiment", source: "ANNOTATION", dataType: "CATEGORICAL", value: null, stringValue: "positive", comment: null, id: "s1", timestamp: new Date() },
  { name: "sentiment", source: "ANNOTATION", dataType: "CATEGORICAL", value: null, stringValue: "negative", comment: null, id: "s2", timestamp: new Date() },
  { name: "sentiment", source: "ANNOTATION", dataType: "CATEGORICAL", value: null, stringValue: "positive", comment: null, id: "s3", timestamp: new Date() },
];

const result = aggregateScores(scores);
// result = {
//   "sentiment-ANNOTATION-CATEGORICAL": {
//     type: "CATEGORICAL",
//     values: ["positive", "negative", "positive"],
//     valueCounts: [
//       { value: "positive", count: 2 },
//       { value: "negative", count: 1 },
//     ],
//   }
// }

Fetching Run Metrics via tRPC

const { data } = trpc.datasets.runsByDatasetIdMetrics.useQuery({
  projectId: "proj_abc123",
  datasetId: "ds_def456",
  runIds: ["run_001", "run_002"],
});

// data.runs = [
//   {
//     id: "run_001",
//     name: "gpt4-experiment",
//     countRunItems: 50,
//     avgTotalCost: 0.0042,
//     totalCost: 0.21,
//     avgLatency: 1250,
//     scores: { "correctness-EVAL-NUMERIC": { type: "NUMERIC", average: 0.85, ... } },
//     runScores: { ... },
//   },
//   {
//     id: "run_002",
//     name: "claude-experiment",
//     countRunItems: 50,
//     avgTotalCost: 0.0038,
//     totalCost: 0.19,
//     avgLatency: 980,
//     scores: { "correctness-EVAL-NUMERIC": { type: "NUMERIC", average: 0.88, ... } },
//     runScores: { ... },
//   },
// ]

Composite Key Utilities

import {
  composeAggregateScoreKey,
  decomposeAggregateScoreKey,
  normalizeScoreName,
  getScoreLabelFromKey,
} from "@/src/features/scores/lib/aggregateScores";

// Composing a key
const key = composeAggregateScoreKey({
  name: "answer-quality",
  source: "EVAL",
  dataType: "NUMERIC",
});
// key = "answer_quality-EVAL-NUMERIC"

// Decomposing a key
const { name, source, dataType } = decomposeAggregateScoreKey(key);
// name = "answer_quality", source = "EVAL", dataType = "NUMERIC"

// Getting a display label
const label = getScoreLabelFromKey(key);
// label = "# answer_quality (eval)"  (with numeric icon prefix)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment