Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Langfuse Langfuse LLM Completion for Evaluation

From Leeroopedia
Knowledge Sources
Domains LLM Integration, LLM Evaluation
Last Updated 2026-02-14 00:00 GMT

Overview

LLM Completion for Evaluation is the principle of invoking a language model through a unified multi-provider interface with structured output enforcement to obtain a deterministic score and reasoning response from an LLM judge.

Description

Once an evaluation prompt has been compiled with extracted variables, the system must invoke an LLM to perform the actual evaluation judgment. LLM Completion for Evaluation defines how this invocation is performed across a diverse set of LLM providers through a single unified interface.

The design addresses several challenges:

  1. Multi-Provider Abstraction -- The evaluation system must work with LLMs from multiple providers (OpenAI, Anthropic, Azure OpenAI, AWS Bedrock, Google Vertex AI, Google AI Studio). Rather than implementing provider-specific calling conventions throughout the evaluation pipeline, the system wraps all providers behind a single function that accepts a common parameter format and delegates to provider-specific LangChain adapters.
  1. Structured Output Enforcement -- Evaluation requires a deterministic response format containing a numeric score and textual reasoning. The system passes a Zod schema to the LLM call, which leverages the provider's native structured output capabilities (e.g., OpenAI's function calling, Anthropic's tool use) to ensure the response conforms to the expected shape. This eliminates fragile prompt-based output parsing.
  1. Internal Trace Creation -- Each LLM evaluation call is itself traced within Langfuse, creating an execution trace that allows users to inspect the evaluation's LLM call, see the exact prompt sent, and debug any issues. These internal traces use a special "langfuse-" prefixed environment to distinguish them from user traces and prevent infinite evaluation loops.
  1. Provider-Specific Message Handling -- Different providers have different requirements for message formatting. Some providers (Anthropic, Vertex AI, Google AI Studio, Bedrock) require at least one user message. The system automatically transforms system-only messages into user messages for these providers.
  1. Secure Credential Management -- API keys are stored encrypted and decrypted only at the point of use. Optional extra headers and custom base URLs support enterprise configurations and API proxies.

Usage

Use LLM Completion for Evaluation when:

  • You need to understand how evaluation LLM calls are made across different providers
  • You are adding support for a new LLM provider to the evaluation system
  • You are debugging evaluation failures related to LLM API errors
  • You need to understand the structured output schema enforcement mechanism
  • You want to understand how evaluation calls are traced internally

Theoretical Basis

The LLM Completion for Evaluation principle implements a provider-adapter pattern with structured output enforcement:

Step 1 - Parameter Preparation:

INPUT:
  messages: array of ChatMessage with role and content
  modelParams: { adapter/provider, model, temperature, maxTokens, topP }
  llmConnection: { encryptedApiKey, extraHeaders, baseURL, config }
  structuredOutputSchema: Zod schema for { score: number, reasoning: string }
  traceSinkParams: { targetProjectId, traceId, traceName, environment }

Step 2 - Credential Resolution:

apiKey = DECRYPT(llmConnection.secretKey)
extraHeaders = DECRYPT_AND_PARSE(llmConnection.extraHeaders)

Step 3 - Internal Tracing Setup:

IF traceSinkParams PROVIDED:
  VALIDATE environment starts with "langfuse-"
  // This is a safety invariant: all internal traces must use
  // the langfuse- prefix to prevent infinite eval loops
  IF NOT valid prefix:
    LOG warning and SKIP trace creation
  ELSE:
    CREATE internal tracing handler
    ADD to callback chain

Step 4 - Message Transformation:

IF messages has only 1 message AND provider requires user message:
  // Anthropic, VertexAI, GoogleAI, Bedrock require user messages
  TRANSFORM system message to HumanMessage

ELSE:
  MAP each message to LangChain message type:
    User     -> HumanMessage
    System   -> SystemMessage (first position) or HumanMessage (other positions)
    Assistant -> AIMessage (with optional tool_calls)
    ToolResult -> ToolMessage

Step 5 - Provider-Specific Client Instantiation:

SWITCH modelParams.adapter:
  CASE OpenAI:
    client = new ChatOpenAI({ model, temperature, apiKey, ... })
  CASE Anthropic:
    client = new ChatAnthropic({ model, temperature, apiKey, ... })
  CASE Azure:
    client = new AzureChatOpenAI({ model, temperature, apiKey, baseURL, ... })
  CASE Bedrock:
    client = new ChatBedrockConverse({ model, region, credentials, ... })
  CASE VertexAI:
    client = new ChatVertexAI({ model, credentials, ... })
  CASE GoogleAIStudio:
    client = new ChatGoogleGenerativeAI({ model, apiKey, ... })

Step 6 - Structured Output Invocation:

IF structuredOutputSchema PROVIDED AND NOT streaming:
  structuredClient = client.withStructuredOutput(structuredOutputSchema)
  response = structuredClient.invoke(messages, { callbacks })
  // Response is already parsed into { score: number, reasoning: string }
  RETURN response

Step 7 - Post-Processing:

AWAIT processTracedEvents()
// Ensures internal Langfuse trace data is flushed
RETURN structured response

Provider Support Matrix:

Provider Adapter Constant Structured Output Special Configuration
OpenAI LLMAdapter.OpenAI Yes (function calling) Custom base URL, extra headers, proxy support
Anthropic LLMAdapter.Anthropic Yes (tool use) Custom base URL, extra headers
Azure OpenAI LLMAdapter.Azure Yes (function calling) Azure-specific base URL with deployment name
AWS Bedrock LLMAdapter.Bedrock Yes Region, credentials, optional default credentials
Google Vertex AI LLMAdapter.VertexAI Yes Service account key, project/location config
Google AI Studio LLMAdapter.GoogleAIStudio Yes API key based

Error Propagation:

LLM call failures are wrapped in LLMCompletionError with a retryable flag. HTTP 429 (rate limit) and 5xx (server error) responses are marked as retryable, while 4xx (client error) responses are marked as non-retryable. This classification drives the retry behavior in the downstream error handling layer.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment