Workflow:Truera Trulens RAG Evaluation With LangChain

Knowledge Sources	TruLens TruLens Docs LangChain Integration RAG Triad
Domains	LLM_Ops, Evaluation, RAG
Last Updated	2026-02-14 08:00 GMT

Overview

End-to-end process for instrumenting, evaluating, and iterating on a Retrieval-Augmented Generation (RAG) application built with LangChain using TruLens feedback functions and the RAG Triad metrics.

Description

This workflow covers the standard procedure for evaluating a LangChain-based RAG pipeline. It begins with initializing a TruLens session and defining the three core RAG evaluation metrics (context relevance, groundedness, and answer relevance). The LangChain chain is then wrapped with TruChain for automatic instrumentation. Each query produces OTEL trace spans that are evaluated asynchronously by configured feedback functions. Results are stored in the database and surfaced through the TruLens dashboard for comparison across app versions.

Usage

Execute this workflow when you have a LangChain chain or agent that performs retrieval-augmented generation and you need to systematically evaluate its quality. This applies when you want to measure how well retrieved contexts match the query, whether the generated answer is grounded in retrieved evidence, and whether the final answer is relevant to the original question. It is the recommended starting point for any LangChain-based LLM application evaluation.

Execution Steps

Step 1: Initialize TruLens Session

Create a TruSession instance which manages the database connection, trace collection, and feedback evaluation lifecycle. By default, TruSession uses a local SQLite database, but can be configured with PostgreSQL or Snowflake connectors for production use.

Key considerations:

TruSession is a singleton; only one instance exists per process
The session must be initialized before wrapping any app
Configure the database connector if using a non-default backend

Step 2: Configure Feedback Provider

Instantiate a feedback provider (e.g., OpenAI, Cortex, LiteLLM) that powers the LLM-as-judge evaluation. The provider wraps an LLM endpoint and exposes evaluation methods like groundedness, relevance, and context relevance.

Key considerations:

Set the model engine (e.g., gpt-4o) and API key
The same provider can be used across multiple feedback functions
Different providers may produce different evaluation characteristics

Step 3: Define RAG Triad Feedback Functions

Configure the three core feedback functions that form the RAG Triad: context relevance (how relevant is each retrieved chunk to the query), groundedness (is the answer supported by the retrieved evidence), and answer relevance (does the answer address the original question). Each function uses Selectors to extract data from specific OTEL span attributes.

Key considerations:

Use convenience shortcuts like on_input(), on_output(), and on_context() for common patterns
For custom extraction, use explicit Selector objects pointing to span types and attributes
Set aggregation methods (e.g., np.mean) for metrics computed across multiple context chunks
Chain-of-thought reasoning variants (e.g., relevance_with_cot_reasons) provide explanations alongside scores

Step 4: Wrap LangChain App With TruChain

Wrap the LangChain chain with TruChain, passing app_name, app_version, and the list of feedback functions. TruChain automatically instruments all LangChain components (retrievers, LLMs, tools) to produce OTEL spans without requiring manual decoration.

Key considerations:

app_name groups related experiments together
app_version enables side-by-side comparison of different configurations
The feedbacks parameter attaches evaluation functions that run on each recorded trace

Step 5: Record Application Traces

Execute the application within a TruChain recording context. Each invocation is captured as a complete trace with nested spans for retrieval, generation, and other operations. The trace data is stored in the database and queued for feedback evaluation.

Key considerations:

Use the context manager pattern: with tru_app as recording
Multiple queries can be recorded within a single session
Traces are collected via OpenTelemetry BatchSpanProcessor

Step 6: Retrieve Evaluation Results

Wait for feedback evaluation to complete and retrieve the results. The evaluation runs asynchronously in a background thread by default. Use retrieve_feedback_results() to block until results are available rather than using arbitrary sleep timers.

Key considerations:

Set a reasonable timeout when calling retrieve_feedback_results()
Results include scores, reasons (if using CoT variants), and metadata
Results are automatically persisted to the configured database

Step 7: View Results in Dashboard

Launch the TruLens Streamlit dashboard to visualize traces, compare app versions on the leaderboard, and drill into individual records. The dashboard provides a leaderboard view, records view with trace inspection, and comparison view for version-to-version analysis.

Key considerations:

The dashboard runs as a separate Streamlit process
Use session.get_leaderboard() for programmatic access to results
Compare different app versions to identify improvements or regressions

Execution Diagram

GitHub URL

Workflow Repository