Implementation:Langfuse Langfuse AggregateScores

Knowledge Sources	Langfuse
Domains	LLM Evaluation, Data Analytics
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for aggregating experiment scores by composite key and computing per-run metrics, provided by Langfuse.

Description

The results aggregation system consists of two main components:

1. aggregateScores (pure function)

A generic function that takes an array of score objects and produces a ScoreAggregate -- a record keyed by composite score key with either numeric or categorical aggregate values.

The function operates in two steps:

Grouping: Scores are grouped by a composite key generated by composeAggregateScoreKey, which combines the normalized name (hyphens and dots replaced with underscores via normalizeScoreName), the source type, and the data type into a string like correctness-EVAL-NUMERIC.
Aggregation: For each group, resolveAggregateType maps the data type to either "NUMERIC" or "CATEGORICAL" (boolean scores are treated as categorical). Numeric groups produce an average; categorical groups produce value counts. When a group contains exactly one score, the aggregate preserves the individual score's comment, id, hasMetadata, and timestamp fields for direct access in the UI.

2. datasetRouter.runsByDatasetIdMetrics (tRPC query)

A tRPC query procedure that fetches per-run operational metrics from ClickHouse and combines them with aggregated trace scores and run scores. It:

Calls getDatasetRunsTableMetricsCh to get countRunItems, avgTotalCost, totalCost, and avgLatency for each run.
Fetches trace-level scores via getTraceScoresForDatasetRuns and run-level scores via getScoresForDatasetRuns.
Maps each run to an object containing the metrics plus two separate ScoreAggregate objects: one for trace scores and one for run scores.

Usage

aggregateScores is used wherever scores need to be summarized: the dataset runs comparison table, experiment detail views, and dashboard widgets. runsByDatasetIdMetrics is the primary tRPC endpoint for the experiment results page, providing all the data needed to render the runs comparison table.

Code Reference

Source Location

Repository: langfuse
File (aggregateScores): web/src/features/scores/lib/aggregateScores.ts
Lines: 72-136
File (runsByDatasetIdMetrics): web/src/features/datasets/server/dataset-router.ts
Lines: 561-608

Signature

// Score aggregation function
export const aggregateScores = <T extends ScoreToAggregate>(
  scores: T[],
): ScoreAggregate => { ... }

// Supporting functions
export const composeAggregateScoreKey = ({
  name,
  source,
  dataType,
}: {
  name: string;
  source: ScoreSourceType;
  dataType: AggregatableScoreDataType;
}): string => { ... }

export const decomposeAggregateScoreKey = (
  key: string,
): {
  name: string;
  source: ScoreSourceType;
  dataType: AggregatableScoreDataType;
} => { ... }

export const normalizeScoreName = (name: string): string => { ... }

export const resolveAggregateType = (
  dataType: AggregatableScoreDataType,
): "NUMERIC" | "CATEGORICAL" => { ... }

// tRPC query for run metrics
datasetRouter.runsByDatasetIdMetrics: protectedProjectProcedure
  .input(datasetRunTableMetricsSchema)
  .query(async ({ input }) => { ... })

Import

import {
  aggregateScores,
  composeAggregateScoreKey,
  decomposeAggregateScoreKey,
  normalizeScoreName,
  resolveAggregateType,
} from "@/src/features/scores/lib/aggregateScores";

I/O Contract

Inputs (aggregateScores)

Name	Type	Required	Description
scores	T[] (extends ScoreToAggregate)	Yes	Array of score objects. Each must have `name` (string), `source` (ScoreSourceType), `dataType` (AggregatableScoreDataType), `value` (number or null), `stringValue` (string or null), `comment` (string or null), `id` (string), `timestamp` (Date or null), and optionally `hasMetadata` (boolean).

Outputs (aggregateScores)

Name	Type	Description
ScoreAggregate	Record<string, NumericAggregate or CategoricalAggregate>	A map from composite score key to aggregate data. Keys have format `{normalizedName}-{source}-{dataType}`.

NumericAggregate structure:

Field	Type	Description
type	"NUMERIC"	Discriminant field.
values	number[]	All individual score values.
average	number	Arithmetic mean of all values.
comment	string or undefined	Present only when there is exactly one score.
id	string or undefined	Present only when there is exactly one score.
hasMetadata	boolean or undefined	Present only when there is exactly one score.
timestamp	Date or undefined	Present only when there is exactly one score.

CategoricalAggregate structure:

Field	Type	Description
type	"CATEGORICAL"	Discriminant field.
values	string[]	All individual string values.
valueCounts	Array<{ value: string; count: number }>	Frequency distribution of values.
comment	string or undefined	Present only when there is exactly one score.
id	string or undefined	Present only when there is exactly one score.
hasMetadata	boolean or undefined	Present only when there is exactly one score.
timestamp	Date or undefined	Present only when there is exactly one score.

Inputs (runsByDatasetIdMetrics)

Name	Type	Required	Description
projectId	string	Yes	The project ID.
datasetId	string	Yes	The dataset ID to fetch runs for.
runIds	string[]	No	Optional list of specific run IDs to filter.
filter	FilterState	No	Optional filter conditions for the runs query.

Outputs (runsByDatasetIdMetrics)

Name	Type	Description
runs	Array<RunMetrics>	Array of per-run metric objects.
runs[].id	string	The dataset run ID.
runs[].name	string	The dataset run name.
runs[].countRunItems	number	Number of dataset run items in this run.
runs[].avgTotalCost	number or null	Average cost per item.
runs[].totalCost	number or null	Total cost for the entire run.
runs[].avgLatency	number or null	Average latency per item in milliseconds.
runs[].scores	ScoreAggregate	Aggregated trace-level scores for this run.
runs[].runScores	ScoreAggregate	Aggregated run-level scores for this run.

Usage Examples

Aggregating Numeric Scores

import { aggregateScores } from "@/src/features/scores/lib/aggregateScores";

const scores = [
  { name: "correctness", source: "EVAL", dataType: "NUMERIC", value: 0.8, stringValue: null, comment: null, id: "s1", timestamp: new Date() },
  { name: "correctness", source: "EVAL", dataType: "NUMERIC", value: 0.9, stringValue: null, comment: null, id: "s2", timestamp: new Date() },
  { name: "correctness", source: "EVAL", dataType: "NUMERIC", value: 0.7, stringValue: null, comment: null, id: "s3", timestamp: new Date() },
];

const result = aggregateScores(scores);
// result = {
//   "correctness-EVAL-NUMERIC": {
//     type: "NUMERIC",
//     values: [0.8, 0.9, 0.7],
//     average: 0.8,
//     comment: undefined,
//     id: undefined,
//     hasMetadata: undefined,
//     timestamp: undefined,
//   }
// }

Aggregating Categorical Scores

const scores = [
  { name: "sentiment", source: "ANNOTATION", dataType: "CATEGORICAL", value: null, stringValue: "positive", comment: null, id: "s1", timestamp: new Date() },
  { name: "sentiment", source: "ANNOTATION", dataType: "CATEGORICAL", value: null, stringValue: "negative", comment: null, id: "s2", timestamp: new Date() },
  { name: "sentiment", source: "ANNOTATION", dataType: "CATEGORICAL", value: null, stringValue: "positive", comment: null, id: "s3", timestamp: new Date() },
];

const result = aggregateScores(scores);
// result = {
//   "sentiment-ANNOTATION-CATEGORICAL": {
//     type: "CATEGORICAL",
//     values: ["positive", "negative", "positive"],
//     valueCounts: [
//       { value: "positive", count: 2 },
//       { value: "negative", count: 1 },
//     ],
//   }
// }

Fetching Run Metrics via tRPC

const { data } = trpc.datasets.runsByDatasetIdMetrics.useQuery({
  projectId: "proj_abc123",
  datasetId: "ds_def456",
  runIds: ["run_001", "run_002"],
});

// data.runs = [
//   {
//     id: "run_001",
//     name: "gpt4-experiment",
//     countRunItems: 50,
//     avgTotalCost: 0.0042,
//     totalCost: 0.21,
//     avgLatency: 1250,
//     scores: { "correctness-EVAL-NUMERIC": { type: "NUMERIC", average: 0.85, ... } },
//     runScores: { ... },
//   },
//   {
//     id: "run_002",
//     name: "claude-experiment",
//     countRunItems: 50,
//     avgTotalCost: 0.0038,
//     totalCost: 0.19,
//     avgLatency: 980,
//     scores: { "correctness-EVAL-NUMERIC": { type: "NUMERIC", average: 0.88, ... } },
//     runScores: { ... },
//   },
// ]

Composite Key Utilities

import {
  composeAggregateScoreKey,
  decomposeAggregateScoreKey,
  normalizeScoreName,
  getScoreLabelFromKey,
} from "@/src/features/scores/lib/aggregateScores";

// Composing a key
const key = composeAggregateScoreKey({
  name: "answer-quality",
  source: "EVAL",
  dataType: "NUMERIC",
});
// key = "answer_quality-EVAL-NUMERIC"

// Decomposing a key
const { name, source, dataType } = decomposeAggregateScoreKey(key);
// name = "answer_quality", source = "EVAL", dataType = "NUMERIC"

// Getting a display label
const label = getScoreLabelFromKey(key);
// label = "# answer_quality (eval)"  (with numeric icon prefix)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment