Implementation:Langfuse Langfuse AggregateScores
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, Data Analytics |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for aggregating experiment scores by composite key and computing per-run metrics, provided by Langfuse.
Description
The results aggregation system consists of two main components:
1. aggregateScores (pure function)
A generic function that takes an array of score objects and produces a ScoreAggregate -- a record keyed by composite score key with either numeric or categorical aggregate values.
The function operates in two steps:
- Grouping: Scores are grouped by a composite key generated by
composeAggregateScoreKey, which combines the normalized name (hyphens and dots replaced with underscores vianormalizeScoreName), the source type, and the data type into a string likecorrectness-EVAL-NUMERIC. - Aggregation: For each group,
resolveAggregateTypemaps the data type to either"NUMERIC"or"CATEGORICAL"(boolean scores are treated as categorical). Numeric groups produce an average; categorical groups produce value counts. When a group contains exactly one score, the aggregate preserves the individual score'scomment,id,hasMetadata, andtimestampfields for direct access in the UI.
2. datasetRouter.runsByDatasetIdMetrics (tRPC query)
A tRPC query procedure that fetches per-run operational metrics from ClickHouse and combines them with aggregated trace scores and run scores. It:
- Calls
getDatasetRunsTableMetricsChto get countRunItems, avgTotalCost, totalCost, and avgLatency for each run. - Fetches trace-level scores via
getTraceScoresForDatasetRunsand run-level scores viagetScoresForDatasetRuns. - Maps each run to an object containing the metrics plus two separate
ScoreAggregateobjects: one for trace scores and one for run scores.
Usage
aggregateScores is used wherever scores need to be summarized: the dataset runs comparison table, experiment detail views, and dashboard widgets. runsByDatasetIdMetrics is the primary tRPC endpoint for the experiment results page, providing all the data needed to render the runs comparison table.
Code Reference
Source Location
- Repository: langfuse
- File (aggregateScores): web/src/features/scores/lib/aggregateScores.ts
- Lines: 72-136
- File (runsByDatasetIdMetrics): web/src/features/datasets/server/dataset-router.ts
- Lines: 561-608
Signature
// Score aggregation function
export const aggregateScores = <T extends ScoreToAggregate>(
scores: T[],
): ScoreAggregate => { ... }
// Supporting functions
export const composeAggregateScoreKey = ({
name,
source,
dataType,
}: {
name: string;
source: ScoreSourceType;
dataType: AggregatableScoreDataType;
}): string => { ... }
export const decomposeAggregateScoreKey = (
key: string,
): {
name: string;
source: ScoreSourceType;
dataType: AggregatableScoreDataType;
} => { ... }
export const normalizeScoreName = (name: string): string => { ... }
export const resolveAggregateType = (
dataType: AggregatableScoreDataType,
): "NUMERIC" | "CATEGORICAL" => { ... }
// tRPC query for run metrics
datasetRouter.runsByDatasetIdMetrics: protectedProjectProcedure
.input(datasetRunTableMetricsSchema)
.query(async ({ input }) => { ... })
Import
import {
aggregateScores,
composeAggregateScoreKey,
decomposeAggregateScoreKey,
normalizeScoreName,
resolveAggregateType,
} from "@/src/features/scores/lib/aggregateScores";
I/O Contract
Inputs (aggregateScores)
| Name | Type | Required | Description |
|---|---|---|---|
| scores | T[] (extends ScoreToAggregate) | Yes | Array of score objects. Each must have name (string), source (ScoreSourceType), dataType (AggregatableScoreDataType), value (number or null), stringValue (string or null), comment (string or null), id (string), timestamp (Date or null), and optionally hasMetadata (boolean).
|
Outputs (aggregateScores)
| Name | Type | Description |
|---|---|---|
| ScoreAggregate | Record<string, NumericAggregate or CategoricalAggregate> | A map from composite score key to aggregate data. Keys have format {normalizedName}-{source}-{dataType}.
|
NumericAggregate structure:
| Field | Type | Description |
|---|---|---|
| type | "NUMERIC" | Discriminant field. |
| values | number[] | All individual score values. |
| average | number | Arithmetic mean of all values. |
| comment | string or undefined | Present only when there is exactly one score. |
| id | string or undefined | Present only when there is exactly one score. |
| hasMetadata | boolean or undefined | Present only when there is exactly one score. |
| timestamp | Date or undefined | Present only when there is exactly one score. |
CategoricalAggregate structure:
| Field | Type | Description |
|---|---|---|
| type | "CATEGORICAL" | Discriminant field. |
| values | string[] | All individual string values. |
| valueCounts | Array<{ value: string; count: number }> | Frequency distribution of values. |
| comment | string or undefined | Present only when there is exactly one score. |
| id | string or undefined | Present only when there is exactly one score. |
| hasMetadata | boolean or undefined | Present only when there is exactly one score. |
| timestamp | Date or undefined | Present only when there is exactly one score. |
Inputs (runsByDatasetIdMetrics)
| Name | Type | Required | Description |
|---|---|---|---|
| projectId | string | Yes | The project ID. |
| datasetId | string | Yes | The dataset ID to fetch runs for. |
| runIds | string[] | No | Optional list of specific run IDs to filter. |
| filter | FilterState | No | Optional filter conditions for the runs query. |
Outputs (runsByDatasetIdMetrics)
| Name | Type | Description |
|---|---|---|
| runs | Array<RunMetrics> | Array of per-run metric objects. |
| runs[].id | string | The dataset run ID. |
| runs[].name | string | The dataset run name. |
| runs[].countRunItems | number | Number of dataset run items in this run. |
| runs[].avgTotalCost | number or null | Average cost per item. |
| runs[].totalCost | number or null | Total cost for the entire run. |
| runs[].avgLatency | number or null | Average latency per item in milliseconds. |
| runs[].scores | ScoreAggregate | Aggregated trace-level scores for this run. |
| runs[].runScores | ScoreAggregate | Aggregated run-level scores for this run. |
Usage Examples
Aggregating Numeric Scores
import { aggregateScores } from "@/src/features/scores/lib/aggregateScores";
const scores = [
{ name: "correctness", source: "EVAL", dataType: "NUMERIC", value: 0.8, stringValue: null, comment: null, id: "s1", timestamp: new Date() },
{ name: "correctness", source: "EVAL", dataType: "NUMERIC", value: 0.9, stringValue: null, comment: null, id: "s2", timestamp: new Date() },
{ name: "correctness", source: "EVAL", dataType: "NUMERIC", value: 0.7, stringValue: null, comment: null, id: "s3", timestamp: new Date() },
];
const result = aggregateScores(scores);
// result = {
// "correctness-EVAL-NUMERIC": {
// type: "NUMERIC",
// values: [0.8, 0.9, 0.7],
// average: 0.8,
// comment: undefined,
// id: undefined,
// hasMetadata: undefined,
// timestamp: undefined,
// }
// }
Aggregating Categorical Scores
const scores = [
{ name: "sentiment", source: "ANNOTATION", dataType: "CATEGORICAL", value: null, stringValue: "positive", comment: null, id: "s1", timestamp: new Date() },
{ name: "sentiment", source: "ANNOTATION", dataType: "CATEGORICAL", value: null, stringValue: "negative", comment: null, id: "s2", timestamp: new Date() },
{ name: "sentiment", source: "ANNOTATION", dataType: "CATEGORICAL", value: null, stringValue: "positive", comment: null, id: "s3", timestamp: new Date() },
];
const result = aggregateScores(scores);
// result = {
// "sentiment-ANNOTATION-CATEGORICAL": {
// type: "CATEGORICAL",
// values: ["positive", "negative", "positive"],
// valueCounts: [
// { value: "positive", count: 2 },
// { value: "negative", count: 1 },
// ],
// }
// }
Fetching Run Metrics via tRPC
const { data } = trpc.datasets.runsByDatasetIdMetrics.useQuery({
projectId: "proj_abc123",
datasetId: "ds_def456",
runIds: ["run_001", "run_002"],
});
// data.runs = [
// {
// id: "run_001",
// name: "gpt4-experiment",
// countRunItems: 50,
// avgTotalCost: 0.0042,
// totalCost: 0.21,
// avgLatency: 1250,
// scores: { "correctness-EVAL-NUMERIC": { type: "NUMERIC", average: 0.85, ... } },
// runScores: { ... },
// },
// {
// id: "run_002",
// name: "claude-experiment",
// countRunItems: 50,
// avgTotalCost: 0.0038,
// totalCost: 0.19,
// avgLatency: 980,
// scores: { "correctness-EVAL-NUMERIC": { type: "NUMERIC", average: 0.88, ... } },
// runScores: { ... },
// },
// ]
Composite Key Utilities
import {
composeAggregateScoreKey,
decomposeAggregateScoreKey,
normalizeScoreName,
getScoreLabelFromKey,
} from "@/src/features/scores/lib/aggregateScores";
// Composing a key
const key = composeAggregateScoreKey({
name: "answer-quality",
source: "EVAL",
dataType: "NUMERIC",
});
// key = "answer_quality-EVAL-NUMERIC"
// Decomposing a key
const { name, source, dataType } = decomposeAggregateScoreKey(key);
// name = "answer_quality", source = "EVAL", dataType = "NUMERIC"
// Getting a display label
const label = getScoreLabelFromKey(key);
// label = "# answer_quality (eval)" (with numeric icon prefix)