Principle:Langfuse Langfuse Score Validation and Creation

Knowledge Sources	Langfuse
Domains	LLM Evaluation, Data Validation
Last Updated	2026-02-14 00:00 GMT

Overview

Score Validation and Creation is the principle of validating the raw LLM judge response against an expected schema and constructing a typed score event that is persisted to S3 and enqueued for ingestion back into the Langfuse data pipeline.

Description

After the LLM judge returns a response, the system must validate that the response conforms to the expected output format and transform it into a score event that can be persisted and displayed alongside the original trace data. Score Validation and Creation handles this critical transition from raw LLM output to a durable, queryable score record.

The process involves two distinct steps:

Response Validation -- The raw LLM response (which should contain score and reasoning fields from structured output) is validated against a Zod v3 schema using safeParse. This ensures the score is a valid number and the reasoning is a valid string. If validation fails, the error message is captured and the evaluation is marked as an unrecoverable error (no retry, since repeating the same LLM call would likely produce the same invalid response).

Score Event Construction -- A validated response is transformed into a ScoreEventType object conforming to Langfuse's event ingestion format. The score event includes:

- A unique event ID and score ID
- The target trace ID and optional observation ID (linking the score to its evaluated data)
- The score name (from the job configuration)
- The numeric score value and reasoning comment
- The source set to "EVAL" (distinguishing automated evaluations from human annotations)
- The environment inherited from the extracted variables
- An execution trace ID linking back to the evaluation's own internal trace
- Metadata including job execution and configuration IDs for debugging
- The data type "NUMERIC" indicating this is a numeric score

The constructed score event is then uploaded to S3 as a JSON blob and enqueued to the ingestion queue for asynchronous processing by the standard Langfuse ingestion pipeline. This two-step persistence (S3 write + queue enqueue) provides durability: even if the ingestion queue processing is delayed, the score data is safely stored in S3.

Usage

Use Score Validation and Creation when:

You need to understand how LLM evaluation results become queryable scores in Langfuse
You are debugging score creation failures or missing scores
You need to understand the score event format for integration purposes
You want to understand the validation logic that determines whether an LLM response is acceptable

Theoretical Basis

The Score Validation and Creation principle implements a validate-transform-persist pipeline:

Step 1 - Schema Construction:

FUNCTION buildEvalScoreSchema(outputSchema):
  // Uses Zod v3 because LLM completion service requires v3
  RETURN zodV3.object({
    reasoning: zodV3.string().describe(outputSchema.reasoning),
    score: zodV3.number().describe(outputSchema.score),
  })

The score field description from the eval template (e.g., "Relevance score from 0 to 10") is passed as the Zod .describe() annotation, which serves dual purpose: it guides the LLM's structured output generation and documents the schema for human readers.

Step 2 - Response Validation:

FUNCTION validateLLMResponse(response, schema):
  result = schema.safeParse(response)
  IF result.success:
    RETURN { success: true, data: { score: result.data.score, reasoning: result.data.reasoning } }
  ELSE:
    RETURN { success: false, error: result.error.message }

The use of safeParse rather than parse prevents exceptions from propagating uncontrollably. The caller inspects the success flag and handles failures by throwing an UnrecoverableError (no retry).

Step 3 - Score Event Construction:

FUNCTION buildScoreEvent(params):
  RETURN {
    id: params.eventId,
    timestamp: current ISO timestamp,
    type: "score-create",
    body: {
      id: params.scoreId,
      traceId: params.traceId,           // Link to evaluated trace
      observationId: params.observationId, // Optional: link to specific observation
      name: params.scoreName,             // From job configuration
      value: params.value,                // Numeric score from LLM
      comment: params.reasoning,          // LLM's explanation
      source: "EVAL",                     // Distinguishes from manual/API scores
      environment: params.environment,     // Inherited from trace data
      executionTraceId: params.executionTraceId, // Link to eval's own trace
      metadata: params.metadata,           // Job execution details
      dataType: "NUMERIC"                 // Score type classification
    }
  }

Step 4 - Two-Phase Persistence:

// Phase 1: Durable storage
UPLOAD score event JSON to S3
  bucket: ingestion bucket
  key: projectId/scoreId/eventId

// Phase 2: Async processing
ENQUEUE to IngestionQueue {
  projectId, scoreId, eventId
}
// The ingestion worker will read from S3 and write to PostgreSQL + ClickHouse

Metadata for Debugging:

The execution metadata included in every score event provides full traceability:

metadata = {
  job_execution_id: "exec-456",        // The specific execution run
  job_configuration_id: "config-123",  // The evaluator configuration
  target_trace_id: "trace-789",        // The trace being evaluated
  target_observation_id: "obs-101",    // Optional observation being evaluated
  target_dataset_item_id: "item-202",  // Optional dataset item reference
}

This metadata allows operators to trace any score back through the entire evaluation pipeline: from the job configuration that triggered it, through the specific execution, to the original trace data.

Related Pages

Implemented By

Implementation:Langfuse_Langfuse_BuildScoreEvent

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment