Principle:Langfuse Langfuse LLM Completion for Evaluation

Knowledge Sources	Langfuse
Domains	LLM Integration, LLM Evaluation
Last Updated	2026-02-14 00:00 GMT

Overview

LLM Completion for Evaluation is the principle of invoking a language model through a unified multi-provider interface with structured output enforcement to obtain a deterministic score and reasoning response from an LLM judge.

Description

Once an evaluation prompt has been compiled with extracted variables, the system must invoke an LLM to perform the actual evaluation judgment. LLM Completion for Evaluation defines how this invocation is performed across a diverse set of LLM providers through a single unified interface.

The design addresses several challenges:

Multi-Provider Abstraction -- The evaluation system must work with LLMs from multiple providers (OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Google Vertex AI, Google AI Studio). Rather than implementing provider-specific calling conventions throughout the evaluation pipeline, the system wraps all providers behind a single function that accepts a common parameter format and delegates to provider-specific LangChain adapters.

Structured Output Enforcement -- Evaluation requires a deterministic response format containing a numeric score and textual reasoning. The system passes a Zod schema to the LLM call, which leverages the provider's native structured output capabilities (e.g., OpenAI's function calling, Anthropic's tool use) to ensure the response conforms to the expected shape. This eliminates fragile prompt-based output parsing.

Internal Trace Creation -- Each LLM evaluation call is itself traced within Langfuse, creating an execution trace that allows users to inspect the evaluation's LLM call, see the exact prompt sent, and debug any issues. These internal traces use a special "langfuse-" prefixed environment to distinguish them from user traces and prevent infinite evaluation loops.

Provider-Specific Message Handling -- Different providers have different requirements for message formatting. Some providers (Anthropic, Vertex AI, Google AI Studio, Bedrock) require at least one user message. The system automatically transforms system-only messages into user messages for these providers.

Secure Credential Management -- API keys are stored encrypted and decrypted only at the point of use. Optional extra headers and custom base URLs support enterprise configurations and API proxies.

Usage

Use LLM Completion for Evaluation when:

You need to understand how evaluation LLM calls are made across different providers
You are adding support for a new LLM provider to the evaluation system
You are debugging evaluation failures related to LLM API errors
You need to understand the structured output schema enforcement mechanism
You want to understand how evaluation calls are traced internally

Theoretical Basis

The LLM Completion for Evaluation principle implements a provider-adapter pattern with structured output enforcement:

Step 1 - Parameter Preparation:

INPUT:
  messages: array of ChatMessage with role and content
  modelParams: { adapter/provider, model, temperature, maxTokens, topP }
  llmConnection: { encryptedApiKey, extraHeaders, baseURL, config }
  structuredOutputSchema: Zod schema for { score: number, reasoning: string }
  traceSinkParams: { targetProjectId, traceId, traceName, environment }

Step 2 - Credential Resolution:

apiKey = DECRYPT(llmConnection.secretKey)
extraHeaders = DECRYPT_AND_PARSE(llmConnection.extraHeaders)

Step 3 - Internal Tracing Setup:

IF traceSinkParams PROVIDED:
  VALIDATE environment starts with "langfuse-"
  // This is a safety invariant: all internal traces must use
  // the langfuse- prefix to prevent infinite eval loops
  IF NOT valid prefix:
    LOG warning and SKIP trace creation
  ELSE:
    CREATE internal tracing handler
    ADD to callback chain

Step 4 - Message Transformation:

IF messages has only 1 message AND provider requires user message:
  // Anthropic, VertexAI, GoogleAI, Bedrock require user messages
  TRANSFORM system message to HumanMessage

ELSE:
  MAP each message to LangChain message type:
    User     -> HumanMessage
    System   -> SystemMessage (first position) or HumanMessage (other positions)
    Assistant -> AIMessage (with optional tool_calls)
    ToolResult -> ToolMessage

Step 5 - Provider-Specific Client Instantiation:

SWITCH modelParams.adapter:
  CASE OpenAI:
    client = new ChatOpenAI({ model, temperature, apiKey, ... })
  CASE Anthropic:
    client = new ChatAnthropic({ model, temperature, apiKey, ... })
  CASE Azure:
    client = new AzureChatOpenAI({ model, temperature, apiKey, baseURL, ... })
  CASE Bedrock:
    client = new ChatBedrockConverse({ model, region, credentials, ... })
  CASE VertexAI:
    client = new ChatVertexAI({ model, credentials, ... })
  CASE GoogleAIStudio:
    client = new ChatGoogleGenerativeAI({ model, apiKey, ... })

Step 6 - Structured Output Invocation:

IF structuredOutputSchema PROVIDED AND NOT streaming:
  structuredClient = client.withStructuredOutput(structuredOutputSchema)
  response = structuredClient.invoke(messages, { callbacks })
  // Response is already parsed into { score: number, reasoning: string }
  RETURN response

Step 7 - Post-Processing:

AWAIT processTracedEvents()
// Ensures internal Langfuse trace data is flushed
RETURN structured response

Provider Support Matrix:

Provider	Adapter Constant	Structured Output	Special Configuration
OpenAI	LLMAdapter.OpenAI	Yes (function calling)	Custom base URL, extra headers, proxy support
Anthropic	LLMAdapter.Anthropic	Yes (tool use)	Custom base URL, extra headers
Azure OpenAI	LLMAdapter.Azure	Yes (function calling)	Azure-specific base URL with deployment name
AWS Bedrock	LLMAdapter.Bedrock	Yes	Region, credentials, optional default credentials
Google Vertex AI	LLMAdapter.VertexAI	Yes	Service account key, project/location config
Google AI Studio	LLMAdapter.GoogleAIStudio	Yes	API key based

Error Propagation:

LLM call failures are wrapped in LLMCompletionError with a retryable flag. HTTP 429 (rate limit) and 5xx (server error) responses are marked as retryable, while 4xx (client error) responses are marked as non-retryable. This classification drives the retry behavior in the downstream error handling layer.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment