Principle:Langfuse Langfuse Score Validation and Creation
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, Data Validation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Score Validation and Creation is the principle of validating the raw LLM judge response against an expected schema and constructing a typed score event that is persisted to S3 and enqueued for ingestion back into the Langfuse data pipeline.
Description
After the LLM judge returns a response, the system must validate that the response conforms to the expected output format and transform it into a score event that can be persisted and displayed alongside the original trace data. Score Validation and Creation handles this critical transition from raw LLM output to a durable, queryable score record.
The process involves two distinct steps:
- Response Validation -- The raw LLM response (which should contain
scoreandreasoningfields from structured output) is validated against a Zod v3 schema usingsafeParse. This ensures the score is a valid number and the reasoning is a valid string. If validation fails, the error message is captured and the evaluation is marked as an unrecoverable error (no retry, since repeating the same LLM call would likely produce the same invalid response).
- Score Event Construction -- A validated response is transformed into a ScoreEventType object conforming to Langfuse's event ingestion format. The score event includes:
- A unique event ID and score ID
- The target trace ID and optional observation ID (linking the score to its evaluated data)
- The score name (from the job configuration)
- The numeric score value and reasoning comment
- The source set to "EVAL" (distinguishing automated evaluations from human annotations)
- The environment inherited from the extracted variables
- An execution trace ID linking back to the evaluation's own internal trace
- Metadata including job execution and configuration IDs for debugging
- The data type "NUMERIC" indicating this is a numeric score
The constructed score event is then uploaded to S3 as a JSON blob and enqueued to the ingestion queue for asynchronous processing by the standard Langfuse ingestion pipeline. This two-step persistence (S3 write + queue enqueue) provides durability: even if the ingestion queue processing is delayed, the score data is safely stored in S3.
Usage
Use Score Validation and Creation when:
- You need to understand how LLM evaluation results become queryable scores in Langfuse
- You are debugging score creation failures or missing scores
- You need to understand the score event format for integration purposes
- You want to understand the validation logic that determines whether an LLM response is acceptable
Theoretical Basis
The Score Validation and Creation principle implements a validate-transform-persist pipeline:
Step 1 - Schema Construction:
FUNCTION buildEvalScoreSchema(outputSchema):
// Uses Zod v3 because LLM completion service requires v3
RETURN zodV3.object({
reasoning: zodV3.string().describe(outputSchema.reasoning),
score: zodV3.number().describe(outputSchema.score),
})
The score field description from the eval template (e.g., "Relevance score from 0 to 10") is passed as the Zod .describe() annotation, which serves dual purpose: it guides the LLM's structured output generation and documents the schema for human readers.
Step 2 - Response Validation:
FUNCTION validateLLMResponse(response, schema):
result = schema.safeParse(response)
IF result.success:
RETURN { success: true, data: { score: result.data.score, reasoning: result.data.reasoning } }
ELSE:
RETURN { success: false, error: result.error.message }
The use of safeParse rather than parse prevents exceptions from propagating uncontrollably. The caller inspects the success flag and handles failures by throwing an UnrecoverableError (no retry).
Step 3 - Score Event Construction:
FUNCTION buildScoreEvent(params):
RETURN {
id: params.eventId,
timestamp: current ISO timestamp,
type: "score-create",
body: {
id: params.scoreId,
traceId: params.traceId, // Link to evaluated trace
observationId: params.observationId, // Optional: link to specific observation
name: params.scoreName, // From job configuration
value: params.value, // Numeric score from LLM
comment: params.reasoning, // LLM's explanation
source: "EVAL", // Distinguishes from manual/API scores
environment: params.environment, // Inherited from trace data
executionTraceId: params.executionTraceId, // Link to eval's own trace
metadata: params.metadata, // Job execution details
dataType: "NUMERIC" // Score type classification
}
}
Step 4 - Two-Phase Persistence:
// Phase 1: Durable storage
UPLOAD score event JSON to S3
bucket: ingestion bucket
key: projectId/scoreId/eventId
// Phase 2: Async processing
ENQUEUE to IngestionQueue {
projectId, scoreId, eventId
}
// The ingestion worker will read from S3 and write to PostgreSQL + ClickHouse
Metadata for Debugging:
The execution metadata included in every score event provides full traceability:
metadata = {
job_execution_id: "exec-456", // The specific execution run
job_configuration_id: "config-123", // The evaluator configuration
target_trace_id: "trace-789", // The trace being evaluated
target_observation_id: "obs-101", // Optional observation being evaluated
target_dataset_item_id: "item-202", // Optional dataset item reference
}
This metadata allows operators to trace any score back through the entire evaluation pipeline: from the job configuration that triggered it, through the specific execution, to the original trace data.