Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Langfuse Langfuse LLM Execution for Experiments

From Leeroopedia
Knowledge Sources
Domains LLM Experimentation, LLM Integration
Last Updated 2026-02-14 00:00 GMT

Overview

LLM execution for experiments is the process of taking a single dataset item, substituting its values into a prompt template, calling an LLM provider, and capturing the full generation lifecycle (input, output, usage, cost) as an internally traced observation.

Description

At the heart of every experiment is the LLM call: given a prompt template with variables and a dataset item with values, produce an LLM response and record the entire interaction as a first-class trace in the observability platform. This process must handle several concerns simultaneously:

  • Variable substitution: The prompt may contain template variables ({{question}}) that need to be replaced with values from the dataset item's input, and message placeholders that accept arrays of chat messages for insertion into chat-type prompts.
  • Provider abstraction: The LLM call must work across multiple providers (OpenAI, Anthropic, etc.) using a unified interface, with provider-specific adapter logic handled transparently.
  • Trace creation: Every LLM call must produce a trace and a generation observation in Langfuse's own tracing system, enabling the user to inspect the exact input, output, latency, token usage, and cost for each dataset item.
  • Structured output: If a structured output schema is configured, it must be passed to the LLM provider to constrain the response format.
  • Error isolation: A failed LLM call for one dataset item must not prevent other items from being processed. Errors are caught, and the function still returns successfully.
  • Generation detail extraction: After the LLM call completes, the function must extract generation details (observation ID, input, output, metadata) for use by downstream evaluation scheduling.

Usage

LLM execution for experiments is used when:

  • A dataset item is being processed as part of an experiment run.
  • The system needs to produce a traced LLM generation for a specific prompt-input combination.
  • Downstream evaluation scheduling requires access to the generation's observation ID, input, and output.

Theoretical Basis

The LLM execution process for experiments follows a template-substitute-call-trace pattern:

Step 1 -- Trace Identity Generation

A deterministic trace ID is generated from the run ID and dataset item ID using a W3C-compatible trace ID function. This ensures that retries for the same item produce the same trace ID, preventing duplicate traces. A separate random UUID is generated for the run item ID.

Step 2 -- Run Item Registration

Before the LLM call, a dataset run item ingestion event is created and processed. This registers the association between the trace and the dataset item, even if the LLM call subsequently fails.

Step 3 -- Variable Substitution

The prompt template is processed to replace variables with dataset item values:

  • For text prompts, the content is wrapped in a single system message. Template variables ({{var}}) are replaced using a template compiler.
  • For chat prompts, each message's content is individually processed. Additionally, message placeholders are compiled by inserting arrays of messages at designated placeholder positions.
  • Only variables that are in the template variable set (and not in the placeholder set) are used for string interpolation.

Step 4 -- LLM Call

The assembled messages are sent to the LLM provider via a unified completion function with the following properties:

  • Streaming is disabled (streaming: false) because the experiment needs the complete response in one shot.
  • Retry count is set to 1 (single retry) because the outer processing loop already handles item-level retries.
  • Model parameters (temperature, max tokens, top-p, etc.) are passed through from the experiment configuration.
  • A trace sink is configured that writes the trace, generation observation, and all associated metadata to the user's project within the langfuse-prompt-experiment environment.

Step 5 -- Generation Detail Capture

A callback function (onGenerationComplete) attached to the trace sink captures the generation details (observation ID, input, output, metadata) when the traced generation event is processed. These details are returned to the caller for use in evaluation scheduling.

FUNCTION processItem(projectId, datasetItem, config):
    traceId = deterministicTraceId(config.runId, datasetItem.id)
    runItemId = randomUUID()

    registerRunItem(runItemId, traceId, datasetItem, config)

    messages = substituteVariables(config.prompt, datasetItem.input, config.variables)

    generationDetails = null
    traceSink = {
        environment: "langfuse-prompt-experiment",
        traceId: traceId,
        targetProject: projectId,
        onGenerationComplete: (details) => generationDetails = details,
    }

    TRY:
        callLLM(messages, config.modelParams, traceSink, streaming=false)
    CATCH:
        // Swallow error, do not retry at this level

    IF generationDetails:
        scheduleEvals(generationDetails)

    enqueueDelayedUpsert(datasetItem, traceId)
    RETURN success

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment