Principle:Langfuse Langfuse LLM Completion for Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLM Integration, LLM Evaluation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
LLM Completion for Evaluation is the principle of invoking a language model through a unified multi-provider interface with structured output enforcement to obtain a deterministic score and reasoning response from an LLM judge.
Description
Once an evaluation prompt has been compiled with extracted variables, the system must invoke an LLM to perform the actual evaluation judgment. LLM Completion for Evaluation defines how this invocation is performed across a diverse set of LLM providers through a single unified interface.
The design addresses several challenges:
- Multi-Provider Abstraction -- The evaluation system must work with LLMs from multiple providers (OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Google Vertex AI, Google AI Studio). Rather than implementing provider-specific calling conventions throughout the evaluation pipeline, the system wraps all providers behind a single function that accepts a common parameter format and delegates to provider-specific LangChain adapters.
- Structured Output Enforcement -- Evaluation requires a deterministic response format containing a numeric score and textual reasoning. The system passes a Zod schema to the LLM call, which leverages the provider's native structured output capabilities (e.g., OpenAI's function calling, Anthropic's tool use) to ensure the response conforms to the expected shape. This eliminates fragile prompt-based output parsing.
- Internal Trace Creation -- Each LLM evaluation call is itself traced within Langfuse, creating an execution trace that allows users to inspect the evaluation's LLM call, see the exact prompt sent, and debug any issues. These internal traces use a special "langfuse-" prefixed environment to distinguish them from user traces and prevent infinite evaluation loops.
- Provider-Specific Message Handling -- Different providers have different requirements for message formatting. Some providers (Anthropic, Vertex AI, Google AI Studio, Bedrock) require at least one user message. The system automatically transforms system-only messages into user messages for these providers.
- Secure Credential Management -- API keys are stored encrypted and decrypted only at the point of use. Optional extra headers and custom base URLs support enterprise configurations and API proxies.
Usage
Use LLM Completion for Evaluation when:
- You need to understand how evaluation LLM calls are made across different providers
- You are adding support for a new LLM provider to the evaluation system
- You are debugging evaluation failures related to LLM API errors
- You need to understand the structured output schema enforcement mechanism
- You want to understand how evaluation calls are traced internally
Theoretical Basis
The LLM Completion for Evaluation principle implements a provider-adapter pattern with structured output enforcement:
Step 1 - Parameter Preparation:
INPUT:
messages: array of ChatMessage with role and content
modelParams: { adapter/provider, model, temperature, maxTokens, topP }
llmConnection: { encryptedApiKey, extraHeaders, baseURL, config }
structuredOutputSchema: Zod schema for { score: number, reasoning: string }
traceSinkParams: { targetProjectId, traceId, traceName, environment }
Step 2 - Credential Resolution:
apiKey = DECRYPT(llmConnection.secretKey)
extraHeaders = DECRYPT_AND_PARSE(llmConnection.extraHeaders)
Step 3 - Internal Tracing Setup:
IF traceSinkParams PROVIDED:
VALIDATE environment starts with "langfuse-"
// This is a safety invariant: all internal traces must use
// the langfuse- prefix to prevent infinite eval loops
IF NOT valid prefix:
LOG warning and SKIP trace creation
ELSE:
CREATE internal tracing handler
ADD to callback chain
Step 4 - Message Transformation:
IF messages has only 1 message AND provider requires user message:
// Anthropic, VertexAI, GoogleAI, Bedrock require user messages
TRANSFORM system message to HumanMessage
ELSE:
MAP each message to LangChain message type:
User -> HumanMessage
System -> SystemMessage (first position) or HumanMessage (other positions)
Assistant -> AIMessage (with optional tool_calls)
ToolResult -> ToolMessage
Step 5 - Provider-Specific Client Instantiation:
SWITCH modelParams.adapter:
CASE OpenAI:
client = new ChatOpenAI({ model, temperature, apiKey, ... })
CASE Anthropic:
client = new ChatAnthropic({ model, temperature, apiKey, ... })
CASE Azure:
client = new AzureChatOpenAI({ model, temperature, apiKey, baseURL, ... })
CASE Bedrock:
client = new ChatBedrockConverse({ model, region, credentials, ... })
CASE VertexAI:
client = new ChatVertexAI({ model, credentials, ... })
CASE GoogleAIStudio:
client = new ChatGoogleGenerativeAI({ model, apiKey, ... })
Step 6 - Structured Output Invocation:
IF structuredOutputSchema PROVIDED AND NOT streaming:
structuredClient = client.withStructuredOutput(structuredOutputSchema)
response = structuredClient.invoke(messages, { callbacks })
// Response is already parsed into { score: number, reasoning: string }
RETURN response
Step 7 - Post-Processing:
AWAIT processTracedEvents()
// Ensures internal Langfuse trace data is flushed
RETURN structured response
Provider Support Matrix:
| Provider | Adapter Constant | Structured Output | Special Configuration |
|---|---|---|---|
| OpenAI | LLMAdapter.OpenAI | Yes (function calling) | Custom base URL, extra headers, proxy support |
| Anthropic | LLMAdapter.Anthropic | Yes (tool use) | Custom base URL, extra headers |
| Azure OpenAI | LLMAdapter.Azure | Yes (function calling) | Azure-specific base URL with deployment name |
| AWS Bedrock | LLMAdapter.Bedrock | Yes | Region, credentials, optional default credentials |
| Google Vertex AI | LLMAdapter.VertexAI | Yes | Service account key, project/location config |
| Google AI Studio | LLMAdapter.GoogleAIStudio | Yes | API key based |
Error Propagation:
LLM call failures are wrapped in LLMCompletionError with a retryable flag. HTTP 429 (rate limit) and 5xx (server error) responses are marked as retryable, while 4xx (client error) responses are marked as non-retryable. This classification drives the retry behavior in the downstream error handling layer.