Overview
Streamlit dashboard tab that enables side-by-side comparison of multiple app versions, including aggregate feedback metrics, per-record difference grids, and detailed trace views for overlapping records.
Description
The Compare Tab module (Compare.py) implements the full comparison workflow for the TruLens dashboard. It allows users to select between 2 and 5 app versions belonging to the same application, fetches records and feedback data for each, and renders:
- App Metric Comparison -- a grouped bar chart (and DataFrame) showing mean feedback scores per version, with a variance or diff column highlighting the largest discrepancies.
- Overlapping Records -- an interactive AgGrid (or native Streamlit dataframe fallback) that joins records across versions by input text, computes per-metric absolute differences (2 versions) or standard deviations (3+ versions), and sorts by aggregate divergence.
- Advanced Filters -- a dynamic form-based filter builder that lets users add comparison clauses (e.g. metric_A of version_1 > metric_A of version_2) to narrow the shared record set.
- Record Comparison -- once a row is selected, renders per-record feedback charts, feedback pill selectors with call details, and trace viewers (standard JSON, OTEL spans, or SIS-compatible JSON) side-by-side for each version.
The module uses Streamlit session state extensively to persist selected app IDs, column data caches, and filter results across reruns. Query parameters are also synchronized so that comparison URLs can be shared.
Usage
Use the Compare Tab when you need to evaluate how different versions of the same LLM application differ in feedback metric scores and individual record-level behavior. It is especially useful for A/B testing prompt variations, model upgrades, or pipeline changes. The tab is accessible from the Leaderboard page by selecting 2--5 app versions and clicking the "Compare" button, or by navigating directly to the Compare page with query-parameter-specified app IDs.
Code Reference
Source Location
Signature
MAX_COMPARATORS = 5
MIN_COMPARATORS = 2
DEFAULT_COMPARATORS = MIN_COMPARATORS
def init_page_state() -> None: ...
def _preprocess_df(records_df: pd.DataFrame) -> pd.DataFrame: ...
def _feedback_cols_intersect(
col_data: Dict[str, Dict[str, List[str]]],
) -> List[str]: ...
def _render_all_app_feedback_plot(
col_data: Dict[str, Dict[str, pd.DataFrame]],
feedback_cols: List[str],
) -> None: ...
def _highlight_variance(row: pd.Series) -> List[str]: ...
def _render_advanced_filters(
query_col: pd.DataFrame,
feedback_cols: List[str],
) -> None: ...
def _build_grid_options(
df: pd.DataFrame,
agg_diff_col: str,
diff_cols: List[str],
record_id_cols: List[str],
num_comparators: int,
) -> dict: ...
def _render_grid(
df: pd.DataFrame,
agg_diff_col: str,
diff_cols: List[str],
record_id_cols: List[str],
num_comparators: int,
grid_key: Optional[str] = None,
) -> pd.DataFrame: ...
def _render_shared_records(
col_data: Dict[str, Dict[str, pd.DataFrame]],
feedback_cols: List[str],
) -> Optional[pd.DataFrame]: ...
def _lookup_app_version(
versions_df: pd.DataFrame,
app_version: Optional[str] = None,
app_id: Optional[str] = None,
) -> Optional[pd.Series]: ...
def _render_version_selectors(
app_name: str,
versions_df: pd.DataFrame,
) -> None: ...
def _reset_page_state() -> None: ...
def render_app_comparison(app_name: str) -> None: ...
def compare_main() -> None: ...
Import
from trulens.dashboard.tabs.Compare import render_app_comparison
from trulens.dashboard.tabs.Compare import compare_main
from trulens.dashboard.tabs.Compare import init_page_state
I/O Contract
Inputs
init_page_state
| Name |
Type |
Required |
Description
|
| (none) |
-- |
-- |
Reads page_name.app_ids from session state and query parameters. If no app IDs are present, initializes them to [None, None].
|
render_app_comparison
| Name |
Type |
Required |
Description
|
| app_name |
str |
yes |
The name of the application whose versions will be compared.
|
_preprocess_df
| Name |
Type |
Required |
Description
|
| records_df |
pd.DataFrame |
yes |
DataFrame of records with "input" and "output" columns to be UTF-8 re-encoded.
|
_feedback_cols_intersect
| Name |
Type |
Required |
Description
|
| col_data |
Dict[str, Dict[str, List[str]]] |
yes |
Per-app-id dictionary, each containing a "feedback_cols" list. Returns the intersection of feedback column names across all apps.
|
_render_all_app_feedback_plot
| Name |
Type |
Required |
Description
|
| col_data |
Dict[str, Dict[str, pd.DataFrame]] |
yes |
Per-app-id dictionary with "records" (DataFrame) and "version" (str) keys.
|
| feedback_cols |
List[str] |
yes |
List of feedback column names to plot.
|
_render_advanced_filters
| Name |
Type |
Required |
Description
|
| query_col |
pd.DataFrame |
yes |
Merged DataFrame with per-version feedback columns, indexed by input.
|
| feedback_cols |
List[str] |
yes |
List of base feedback column names available for filtering.
|
_build_grid_options
| Name |
Type |
Required |
Description
|
| df |
pd.DataFrame |
yes |
DataFrame to be displayed in the AgGrid.
|
| agg_diff_col |
str |
yes |
Column name for the aggregate difference/variance metric (e.g. "Mean Diff").
|
| diff_cols |
List[str] |
yes |
Per-feedback difference column names.
|
| record_id_cols |
List[str] |
yes |
Per-version record ID columns to hide.
|
| num_comparators |
int |
yes |
Number of app versions being compared (affects tooltip text).
|
_render_grid
| Name |
Type |
Required |
Description
|
| df |
pd.DataFrame |
yes |
DataFrame to render in the grid.
|
| agg_diff_col |
str |
yes |
Aggregate difference column name.
|
| diff_cols |
List[str] |
yes |
Individual difference column names.
|
| record_id_cols |
List[str] |
yes |
Hidden record ID column names.
|
| num_comparators |
int |
yes |
Number of app versions being compared.
|
| grid_key |
Optional[str] |
no |
Streamlit key for the AgGrid widget. Defaults to None.
|
_lookup_app_version
| Name |
Type |
Required |
Description
|
| versions_df |
pd.DataFrame |
yes |
DataFrame containing app_version and app_id columns.
|
| app_version |
Optional[str] |
no |
App version string to look up. Mutually exclusive with app_id.
|
| app_id |
Optional[str] |
no |
App ID string to look up. Mutually exclusive with app_version.
|
compare_main
| Name |
Type |
Required |
Description
|
| (none) |
-- |
-- |
Entry point function. Calls set_page_config, init_page_state, render_sidebar, and render_app_comparison.
|
Outputs
render_app_comparison
| Name |
Type |
Description
|
| (none -- renders to Streamlit) |
None |
Renders the complete comparison UI to the Streamlit page.
|
_feedback_cols_intersect
| Name |
Type |
Description
|
| feedback_cols |
List[str] |
The intersection of feedback column names across all compared app versions.
|
_render_shared_records
| Name |
Type |
Description
|
| selected_record_ids |
Optional[pd.DataFrame] |
DataFrame with per-version record_id columns for the user-selected row, or None if no row is selected.
|
_render_grid
| Name |
Type |
Description
|
| selected_rows |
pd.DataFrame |
DataFrame of user-selected rows from the grid (may be empty).
|
_lookup_app_version
| Name |
Type |
Description
|
| row |
Optional[pd.Series] |
The matching row from versions_df, or None if not found.
|
Usage Examples
# Example 1: Running the compare tab as a standalone Streamlit page
# (typically invoked as: streamlit run tabs/Compare.py)
from trulens.dashboard.tabs.Compare import compare_main
if __name__ == "__main__":
compare_main()
# Example 2: Programmatically navigating to the Compare tab from the Leaderboard
import streamlit as st
from trulens.dashboard.constants import COMPARE_PAGE_NAME
# After selecting app versions on the Leaderboard tab:
selected_app_ids = ["app_id_v1", "app_id_v2", "app_id_v3"]
st.session_state[f"{COMPARE_PAGE_NAME}.app_ids"] = selected_app_ids
st.switch_page("tabs/Compare.py")
# Example 3: Rendering the comparison view within a custom Streamlit layout
from trulens.dashboard.tabs.Compare import init_page_state, render_app_comparison
init_page_state()
render_app_comparison(app_name="my_chatbot_app")
Related Pages