Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Langfuse Langfuse ChatML Normalization

From Leeroopedia
Knowledge Sources
Domains Data Normalization, AI Frameworks, LLM Messaging
Last Updated 2026-02-14 00:00 GMT

Overview

ChatML Normalization is the principle of converting diverse LLM provider message formats into a unified ChatML (Chat Markup Language) representation through an adapter registry that auto-detects the source format and applies provider-specific preprocessing.

Description

After raw input/output data has been extracted from OTel span attributes (via the Input/Output Extraction stage), the data exists in whatever format the originating LLM provider or framework uses. Different providers encode messages differently:

  • OpenAI uses { role, content } with optional tool_calls arrays.
  • Vercel AI SDK (v5) wraps messages in its own format with metadata keys.
  • Gemini / VertexAI uses { parts: [...] } with structured content parts.
  • LangGraph / LangChain nests messages under a messages key with class-based type indicators.
  • Microsoft Agent Framework and Semantic Kernel use their own message schemas.
  • Pydantic AI uses yet another format with framework-specific conventions.

The ChatML Normalization principle solves this heterogeneity through an ordered adapter registry where:

  1. Each adapter declares a detect function that inspects context (metadata, scope name, data shape) to determine if the data originates from its provider.
  2. Each adapter provides a preprocess function that transforms the provider-specific format into the standard ChatML array format before validation.
  3. Adapters are evaluated in a fixed priority order, with the first matching adapter winning.
  4. An optional framework override in the context bypasses detection and uses the specified adapter directly.
  5. A generic adapter serves as the fallback when no provider-specific adapter matches.

The normalized output is validated against a Zod schema (ChatMlArraySchema) ensuring structural consistency. The schema defines a ChatMlMessage type with role, content, and optional provider-specific fields.

Usage

Apply this principle when:

  • Displaying LLM conversations in a unified UI regardless of which provider generated them.
  • Running evaluations or analytics that operate on a normalized message format.
  • Combining input and output messages into a single conversation thread for visualization.
  • Supporting new LLM providers by adding an adapter without modifying existing normalization logic.

Theoretical Basis

The normalization system follows the Adapter Pattern with ordered detection:

NORMALIZE INPUT or OUTPUT
    |
    v
BUILD CONTEXT
    ctx = {
        framework?: string,     // Optional explicit override
        metadata: object,       // Metadata from observation/trace
        data: unknown,          // The raw input or output data
        scopeName?: string,     // Instrumentation scope name
        ...other context
    }
    |
    v
SELECT ADAPTER (ordered evaluation):
    1. If ctx.framework is set -> find adapter by ID, use if found
    2. Otherwise, evaluate adapters in order:
       a. langgraph   -- Detects LangGraph/LangChain message structures
       b. aisdk       -- Detects Vercel AI SDK v5 format
       c. openai      -- Detects OpenAI Chat Completions and Responses API format
       d. gemini      -- Detects Gemini/VertexAI message format
       e. microsoftAgent -- Detects Microsoft Agent Framework format
       f. pydanticAI  -- Detects Pydantic AI framework format
       g. semanticKernel -- Detects Microsoft Semantic Kernel (by scope.name prefix)
       h. generic     -- Always matches (fallback)
    3. First adapter where detect(ctx) returns true wins
    |
    v
PREPROCESS
    transformedData = adapter.preprocess(rawData, direction, ctx)
    - direction: "input" or "output"
    - Adapter applies provider-specific transformations:
      * Unwrap nested message structures
      * Map provider-specific role names to standard roles
      * Restructure content parts into ChatML format
      * Handle tool call formatting differences
    |
    v
VALIDATE (for input):
    result = ChatMlArraySchema.safeParse(transformedData)
    - Also tries: unwrapping [[messages]] -> [messages]
    - Also tries: extracting { messages: [...] } -> [...]
    |
    v
VALIDATE (for output):
    result = ChatMlArraySchema.safeParse(
        Array.isArray(transformedData) ? transformedData : [transformedData]
    )
    - Also handles: { messages: [...] } -> [...]
    |
    v
RETURN { success: boolean, data?: ChatMlMessage[], error?: ZodError }

Adapter priority rationale:

  • langgraph first: LangGraph messages share the OpenAI-like format but include additional class-based metadata that would be lost if processed by the generic OpenAI adapter. It must be checked before openai.
  • aisdk before openai: Vercel AI SDK wraps OpenAI-compatible messages with additional telemetry metadata. Detecting it early prevents misidentification.
  • generic last: The generic adapter performs no preprocessing and relies entirely on the data already being in ChatML-compatible format. It is the catch-all.

Additional normalization utilities:

  • cleanLegacyOutput() handles legacy { completion: "..." } output format.
  • extractAdditionalInput() extracts non-message keys from input objects (e.g., tools, system prompts passed alongside messages).
  • combineInputOutputMessages() merges normalized input and output into a single conversation thread, defaulting output messages to the "assistant" role.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment