Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Truera Trulens Feedback Display Formatting

From Leeroopedia
Knowledge Sources
Domains Feedback Visualization, UX Design, ML Evaluation, Dashboard UI
Last Updated 2026-02-14 08:00 GMT

Overview

Feedback Display Formatting is the principle of converting raw numerical feedback evaluation scores into human-readable visual representations through score-to-category mapping, color coding, icon selection, DataFrame cell highlighting, interactive pill rendering, and specialized expansions for structured feedback types such as groundedness.

Description

TruLens evaluates LLM applications by running feedback functions (metrics) that produce numerical scores, typically in the range [0, 1]. Raw numbers alone are insufficient for rapid human comprehension -- a score of 0.72 means little without context about whether that value represents a passing, warning, or failing result, and whether higher or lower values are desirable. The Feedback Display Formatting principle defines the complete pipeline for transforming these raw scores into visually informative representations across multiple rendering surfaces.

The principle encompasses several interconnected concerns:

Score-to-Category Mapping -- A classification system that maps continuous scores to discrete categories (PASS, WARNING, FAIL, UNKNOWN, DISTANCE) based on configurable thresholds. The mapping is direction-aware: for "higher is better" metrics, scores above 0.8 are PASS and below 0.6 are FAIL; for "lower is better" metrics, the thresholds are inverted (below 0.2 is PASS, above 0.4 is FAIL). Each category carries a color (green, yellow, red, gray), an icon (checkmark, warning sign, stop sign, question mark), and an adjective (high, medium, low).

Cell Highlighting -- When feedback scores appear in tabular data (both AgGrid and native Streamlit DataFrames), cell background colors are set according to the score's category. This provides at-a-glance quality assessment across many records. The highlighting rules are expressed as JavaScript expressions for AgGrid (evaluated client-side) and as pandas Styler functions for Streamlit DataFrames (evaluated server-side).

Feedback Pill Rendering -- Individual feedback results are displayed as interactive "pills" (using Streamlit's st.pills component when available, falling back to st.selectbox on older versions). Each pill shows the metric name, its icon, and the numerical score. Clicking a pill reveals the detailed feedback call data underneath.

Feedback Call Detail Display -- When a user selects a feedback pill, the underlying evaluation calls are displayed in an expanded DataFrame. For OTEL-based spans, the system first separates EVAL_ROOT spans from EVAL spans, filters to only the most recent evaluation root (handling re-evaluation deduplication), then processes the EVAL spans into a tabular display of arguments and scores.

Groundedness Expansion -- The groundedness feedback type receives special treatment. Its "reasons" or "explanation" fields contain structured data (either as JSON lists of dictionaries or as formatted strings) that are parsed and expanded into a three-column table: Statement, Supporting Evidence from Source, and Score. The parser handles multiple data formats: new-style JSON lists of dictionaries, legacy list-of-strings with regex extraction, and plain-text STATEMENT blocks.

Jupyter Widget Rendering -- For notebook environments, the AppUI widget system provides an ipywidgets-based interface with interactive selectors, HTML rendering of values, and live-updating record displays. This is a separate rendering surface from the Streamlit dashboard but shares the same underlying feedback data model.

In the ML observability landscape, this principle ensures that evaluation results are not just stored but are actionable -- users can immediately identify problematic records, understand metric trends, and drill into the reasoning behind individual scores without manual data manipulation.

Usage

This pattern is appropriate when:

  • Feedback scores need to be displayed across multiple rendering surfaces (Streamlit dashboard, AgGrid tables, pandas DataFrames, Jupyter notebooks) with consistent visual semantics.
  • Metrics have different directionality -- some where higher is better (e.g., relevance, coherence) and some where lower is better (e.g., toxicity, hallucination rate) -- requiring direction-aware formatting.
  • The system must support progressive disclosure of feedback details: summary pills at the top level, expandable call details on selection, and specialized views for specific feedback types like groundedness.
  • Backward compatibility across multiple data formats is required -- the same display logic must handle legacy string-based feedback reasons, modern JSON-structured reasons, and OTEL span-based evaluation data.
  • Deduplication of re-evaluations is needed -- when the same metric is re-run on a record, only the most recent evaluation result should be displayed, determined by comparing EVAL_ROOT timestamps.

Theoretical Basis

The feedback display formatting pipeline operates through the following abstract stages:

Stage 1: Direction Resolution

For each feedback metric, determine whether higher scores are better or lower scores are better. This information is stored in the feedback definition and retrieved as a dictionary mapping feedback names to boolean higher_is_better flags. A system-wide default direction (HIGHER_IS_BETTER) is used when no explicit direction is specified.

Stage 2: Score Classification

Given a score and its direction, classify it into a category:

 FUNCTION classify(score, higher_is_better):
     IF score is null or NaN:
         RETURN UNKNOWN (gray, "?")
     IF metric is a distance metric:
         RETURN DISTANCE (gray, ruler icon)
     direction = "HIGHER_IS_BETTER" if higher_is_better else "LOWER_IS_BETTER"
     thresholds = get_thresholds(direction)
     FOR EACH category IN [PASS, WARNING, FAIL]:
         IF direction is HIGHER_IS_BETTER:
             IF score >= category.threshold: RETURN category
         ELSE:
             IF score <= category.threshold: RETURN category
     RETURN UNKNOWN

For HIGHER_IS_BETTER, the thresholds are: PASS >= 0.8, WARNING >= 0.6, FAIL >= 0.0. For LOWER_IS_BETTER, they are: PASS <= 0.2, WARNING <= 0.4, FAIL <= 1.0.

Stage 3: Visual Encoding

Each category maps to a fixed visual encoding:

Category Color Icon Adjective
PASS Green (#aaffaa70) Checkmark "high" or "low" (depending on direction)
WARNING Yellow (#ffffaa70) Warning sign "medium"
FAIL Red (#ffaaaa70) Stop sign "low" or "high" (depending on direction)
UNKNOWN Gray (#aaaaaa44) Question mark "unknown"
DISTANCE Gray (#808080) Ruler "distance"

Stage 4: Cell Rule Generation

For AgGrid rendering, JavaScript expressions are pre-generated for each direction. These expressions evaluate the cell value against thresholds and assign CSS classes (cat-pass, cat-warning, cat-fail, cat-unknown) that map to background colors. For pandas DataFrames, a Python styling function applies the same logic per-row or per-cell.

Stage 5: Pill Composition

Feedback pills are composed by iterating over feedback column names, filtering to those with non-null values for the selected record, sorting alphabetically, and formatting each as:

 "{icon} {metric_name} {score:.2f}"

The pills component returns the user's selection, which triggers detail rendering.

Stage 6: Evaluation Call Processing

When a pill is selected, the raw feedback call data is processed:

 FUNCTION process_feedback_calls(calls):
     Separate calls into EVAL_ROOT spans and EVAL spans
     IF EVAL_ROOT spans exist:
         Deduplicate by (target_span_id, target_span_attribute)
         Keep only EVAL_ROOTs with the most recent timestamp
         Filter EVAL spans to only those matching retained EVAL_ROOT IDs
     Convert EVAL span arguments to formatted strings
     Build DataFrame with columns from arguments plus "score" and "meta"
     Expand meta dictionary into additional columns
     RETURN formatted DataFrame

Stage 7: Groundedness Expansion

For groundedness-type feedback, the DataFrame undergoes additional processing:

 FUNCTION expand_groundedness(df):
     IF "reasons" column contains list-of-dict data:
         Parse into Statement, Supporting Evidence, Score columns
     ELSE IF "explanation" column contains JSON string:
         Parse JSON into structured data
     ELSE IF "explanation" column contains list of strings:
         Extract Criteria, Supporting Evidence, Score via regex
     ELSE IF text contains "STATEMENT N:" blocks:
         Parse structured text into tabular form
     RETURN expanded DataFrame with three columns

This multi-format parser ensures backward compatibility across all historical versions of the groundedness feedback output format.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment