Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Mlflow Mlflow LLM Tracing

From Leeroopedia
Knowledge Sources
Domains LLM_Ops, Observability, GenAI
Last Updated 2026-02-13 20:00 GMT

Overview

End-to-end process for instrumenting LLM and agentic applications with MLflow tracing to capture detailed execution traces with nested spans for debugging and performance monitoring.

Description

This workflow outlines the procedure for adding observability to LLM-powered applications using MLflow's tracing system. It captures the complete execution flow of GenAI applications as hierarchical traces composed of spans — each representing an operation such as an LLM call, retrieval step, tool invocation, or embedding computation. Traces record inputs, outputs, latency, token usage, and error states. The system supports both automatic instrumentation via autologging integrations (OpenAI, LangChain, Anthropic, etc.) and manual instrumentation via decorators and context managers.

Key capabilities:

  • Automatic tracing for 15+ LLM frameworks via autolog
  • Manual tracing with decorator and context manager APIs
  • Nested span hierarchies reflecting call structure
  • Token usage, latency, and error tracking per span
  • Trace search and filtering via the MLflow UI

Usage

Execute this workflow when you are building or debugging an LLM application, AI agent, or RAG pipeline and need visibility into the internal execution flow — what prompts were sent, what responses were received, how long each step took, and where errors occurred. This applies to applications using OpenAI, LangChain, LlamaIndex, Anthropic, DSPy, PydanticAI, and other supported frameworks.

Execution Steps

Step 1: Enable Autologging or Configure Manual Tracing

Choose between automatic or manual instrumentation. For supported frameworks, enable autologging with a single API call which automatically patches framework methods to emit traces. For custom code, prepare to use the decorator or context manager API.

Key considerations:

  • Autologging supports OpenAI, LangChain, Anthropic, Bedrock, DSPy, Mistral, LiteLLM, PydanticAI, and more
  • Autolog is called once and patches all subsequent API calls globally
  • Manual tracing can be mixed with autologging for custom spans

Step 2: Set Trace Destination

Configure where traces are stored. By default, traces are logged to the active MLflow experiment. Traces can also be directed to Unity Catalog inference tables or other backends depending on deployment environment.

Key considerations:

  • Set the experiment with set_experiment to group related traces
  • Configure the tracking URI to point to local or remote storage
  • Async logging is available for high-throughput applications

Step 3: Instrument Application Code

For autologged frameworks, simply call the framework APIs normally — traces are created automatically. For custom logic, wrap functions with the trace decorator or use the start_span context manager to create manual spans with typed categories (LLM, RETRIEVER, EMBEDDING, TOOL, AGENT, etc.).

Key considerations:

  • The trace decorator captures function inputs and outputs automatically
  • Span types categorize operations for visualization (LLM, RETRIEVER, TOOL, etc.)
  • Custom attributes can be attached to spans for additional metadata

Step 4: Execute Application

Run the LLM application normally. Each invocation of traced functions creates a trace tree with parent-child span relationships. The root span corresponds to the top-level entry point, and child spans represent sub-operations like LLM calls, retrieval steps, and tool invocations.

Key considerations:

  • Traces are created per top-level invocation
  • Nested function calls produce nested spans automatically
  • Streaming responses are supported with span finalization on stream completion

Step 5: Add Assessments and Feedback

Optionally attach human or automated assessments to traces. Feedback (thumbs up/down, ratings), expectations (ground truth), and custom assessments can be logged against specific traces for quality evaluation.

Key considerations:

  • Assessments link human judgment to specific traces
  • Expectations provide ground truth for automated evaluation
  • Feedback supports structured ratings and free-text comments

Step 6: Search and Analyze Traces

Query stored traces using filter expressions to find specific executions. The MLflow UI provides a trace explorer with span-level detail views showing inputs, outputs, attributes, and timing. Programmatic search supports filtering by attributes, status, and timestamps.

Key considerations:

  • Search traces by experiment, model ID, status, or custom attributes
  • The trace UI shows the full span tree with timing waterfall
  • Export traces for offline analysis or evaluation pipelines

Execution Diagram

GitHub URL

Workflow Repository