Workflow:Mlflow Mlflow LLM Tracing
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Observability, GenAI |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
End-to-end process for instrumenting LLM and agentic applications with MLflow tracing to capture detailed execution traces with nested spans for debugging and performance monitoring.
Description
This workflow outlines the procedure for adding observability to LLM-powered applications using MLflow's tracing system. It captures the complete execution flow of GenAI applications as hierarchical traces composed of spans — each representing an operation such as an LLM call, retrieval step, tool invocation, or embedding computation. Traces record inputs, outputs, latency, token usage, and error states. The system supports both automatic instrumentation via autologging integrations (OpenAI, LangChain, Anthropic, etc.) and manual instrumentation via decorators and context managers.
Key capabilities:
- Automatic tracing for 15+ LLM frameworks via autolog
- Manual tracing with decorator and context manager APIs
- Nested span hierarchies reflecting call structure
- Token usage, latency, and error tracking per span
- Trace search and filtering via the MLflow UI
Usage
Execute this workflow when you are building or debugging an LLM application, AI agent, or RAG pipeline and need visibility into the internal execution flow — what prompts were sent, what responses were received, how long each step took, and where errors occurred. This applies to applications using OpenAI, LangChain, LlamaIndex, Anthropic, DSPy, PydanticAI, and other supported frameworks.
Execution Steps
Step 1: Enable Autologging or Configure Manual Tracing
Choose between automatic or manual instrumentation. For supported frameworks, enable autologging with a single API call which automatically patches framework methods to emit traces. For custom code, prepare to use the decorator or context manager API.
Key considerations:
- Autologging supports OpenAI, LangChain, Anthropic, Bedrock, DSPy, Mistral, LiteLLM, PydanticAI, and more
- Autolog is called once and patches all subsequent API calls globally
- Manual tracing can be mixed with autologging for custom spans
Step 2: Set Trace Destination
Configure where traces are stored. By default, traces are logged to the active MLflow experiment. Traces can also be directed to Unity Catalog inference tables or other backends depending on deployment environment.
Key considerations:
- Set the experiment with set_experiment to group related traces
- Configure the tracking URI to point to local or remote storage
- Async logging is available for high-throughput applications
Step 3: Instrument Application Code
For autologged frameworks, simply call the framework APIs normally — traces are created automatically. For custom logic, wrap functions with the trace decorator or use the start_span context manager to create manual spans with typed categories (LLM, RETRIEVER, EMBEDDING, TOOL, AGENT, etc.).
Key considerations:
- The trace decorator captures function inputs and outputs automatically
- Span types categorize operations for visualization (LLM, RETRIEVER, TOOL, etc.)
- Custom attributes can be attached to spans for additional metadata
Step 4: Execute Application
Run the LLM application normally. Each invocation of traced functions creates a trace tree with parent-child span relationships. The root span corresponds to the top-level entry point, and child spans represent sub-operations like LLM calls, retrieval steps, and tool invocations.
Key considerations:
- Traces are created per top-level invocation
- Nested function calls produce nested spans automatically
- Streaming responses are supported with span finalization on stream completion
Step 5: Add Assessments and Feedback
Optionally attach human or automated assessments to traces. Feedback (thumbs up/down, ratings), expectations (ground truth), and custom assessments can be logged against specific traces for quality evaluation.
Key considerations:
- Assessments link human judgment to specific traces
- Expectations provide ground truth for automated evaluation
- Feedback supports structured ratings and free-text comments
Step 6: Search and Analyze Traces
Query stored traces using filter expressions to find specific executions. The MLflow UI provides a trace explorer with span-level detail views showing inputs, outputs, attributes, and timing. Programmatic search supports filtering by attributes, status, and timestamps.
Key considerations:
- Search traces by experiment, model ID, status, or custom attributes
- The trace UI shows the full span tree with timing waterfall
- Export traces for offline analysis or evaluation pipelines