Workflow:Truera Trulens RAG Evaluation With LangChain
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Evaluation, RAG |
| Last Updated | 2026-02-14 08:00 GMT |
Overview
End-to-end process for instrumenting, evaluating, and iterating on a Retrieval-Augmented Generation (RAG) application built with LangChain using TruLens feedback functions and the RAG Triad metrics.
Description
This workflow covers the standard procedure for evaluating a LangChain-based RAG pipeline. It begins with initializing a TruLens session and defining the three core RAG evaluation metrics (context relevance, groundedness, and answer relevance). The LangChain chain is then wrapped with TruChain for automatic instrumentation. Each query produces OTEL trace spans that are evaluated asynchronously by configured feedback functions. Results are stored in the database and surfaced through the TruLens dashboard for comparison across app versions.
Usage
Execute this workflow when you have a LangChain chain or agent that performs retrieval-augmented generation and you need to systematically evaluate its quality. This applies when you want to measure how well retrieved contexts match the query, whether the generated answer is grounded in retrieved evidence, and whether the final answer is relevant to the original question. It is the recommended starting point for any LangChain-based LLM application evaluation.
Execution Steps
Step 1: Initialize TruLens Session
Create a TruSession instance which manages the database connection, trace collection, and feedback evaluation lifecycle. By default, TruSession uses a local SQLite database, but can be configured with PostgreSQL or Snowflake connectors for production use.
Key considerations:
- TruSession is a singleton; only one instance exists per process
- The session must be initialized before wrapping any app
- Configure the database connector if using a non-default backend
Step 2: Configure Feedback Provider
Instantiate a feedback provider (e.g., OpenAI, Cortex, LiteLLM) that powers the LLM-as-judge evaluation. The provider wraps an LLM endpoint and exposes evaluation methods like groundedness, relevance, and context relevance.
Key considerations:
- Set the model engine (e.g., gpt-4o) and API key
- The same provider can be used across multiple feedback functions
- Different providers may produce different evaluation characteristics
Step 3: Define RAG Triad Feedback Functions
Configure the three core feedback functions that form the RAG Triad: context relevance (how relevant is each retrieved chunk to the query), groundedness (is the answer supported by the retrieved evidence), and answer relevance (does the answer address the original question). Each function uses Selectors to extract data from specific OTEL span attributes.
Key considerations:
- Use convenience shortcuts like on_input(), on_output(), and on_context() for common patterns
- For custom extraction, use explicit Selector objects pointing to span types and attributes
- Set aggregation methods (e.g., np.mean) for metrics computed across multiple context chunks
- Chain-of-thought reasoning variants (e.g., relevance_with_cot_reasons) provide explanations alongside scores
Step 4: Wrap LangChain App With TruChain
Wrap the LangChain chain with TruChain, passing app_name, app_version, and the list of feedback functions. TruChain automatically instruments all LangChain components (retrievers, LLMs, tools) to produce OTEL spans without requiring manual decoration.
Key considerations:
- app_name groups related experiments together
- app_version enables side-by-side comparison of different configurations
- The feedbacks parameter attaches evaluation functions that run on each recorded trace
Step 5: Record Application Traces
Execute the application within a TruChain recording context. Each invocation is captured as a complete trace with nested spans for retrieval, generation, and other operations. The trace data is stored in the database and queued for feedback evaluation.
Key considerations:
- Use the context manager pattern: with tru_app as recording
- Multiple queries can be recorded within a single session
- Traces are collected via OpenTelemetry BatchSpanProcessor
Step 6: Retrieve Evaluation Results
Wait for feedback evaluation to complete and retrieve the results. The evaluation runs asynchronously in a background thread by default. Use retrieve_feedback_results() to block until results are available rather than using arbitrary sleep timers.
Key considerations:
- Set a reasonable timeout when calling retrieve_feedback_results()
- Results include scores, reasons (if using CoT variants), and metadata
- Results are automatically persisted to the configured database
Step 7: View Results in Dashboard
Launch the TruLens Streamlit dashboard to visualize traces, compare app versions on the leaderboard, and drill into individual records. The dashboard provides a leaderboard view, records view with trace inspection, and comparison view for version-to-version analysis.
Key considerations:
- The dashboard runs as a separate Streamlit process
- Use session.get_leaderboard() for programmatic access to results
- Compare different app versions to identify improvements or regressions