Workflow:Evidentlyai Evidently LLM Evaluation Monitoring

Knowledge Sources	Evidently Evidently Docs Evidently Blog
Domains	LLM_Ops, LLM_Evaluation, Monitoring
Last Updated	2026-02-14 10:00 GMT

Overview

End-to-end process for evaluating LLM-powered system outputs using Evidently descriptors (sentiment analysis, LLM judges, text quality checks) and monitoring evaluation results over time via database storage and Grafana dashboards.

Description

This workflow covers the standard procedure for evaluating and monitoring Large Language Model outputs in production. It uses Evidently's descriptor framework to apply row-level evaluations to LLM responses, including sentiment analysis, LLM-as-a-judge evaluations (negativity, decline detection), and text statistics (sentence count). Evaluation results are computed per response, stored in PostgreSQL, and visualized in Grafana for ongoing quality monitoring.

Goal: A continuous monitoring pipeline that evaluates every LLM response against quality criteria and stores scored results for trend analysis and alerting.

Scope: From raw LLM question-answer pairs through descriptor-based evaluation to stored metrics and dashboard visualization.

Strategy: Uses row-level descriptors to evaluate each LLM response individually, combining model-based evaluators (NLTK sentiment, LLM judges) with rule-based checks, then persists results for time-series monitoring.

Usage

Execute this workflow when you operate a chatbot, question-answering system, or any LLM-powered application and need to continuously monitor the quality of generated responses. This is essential for detecting quality degradation, tracking tone shifts, identifying increases in refusal rates, and maintaining service level objectives for LLM outputs.

Execution Steps

Step 1: Collect LLM Interactions

Capture question-answer pairs from the LLM system along with metadata such as token usage (input/output tokens), timestamps, and any additional context. Each interaction becomes a row in the evaluation dataset.

Key considerations:

Store both the input prompt and the generated response
Capture token usage metrics for cost monitoring
Include timestamps for temporal analysis

Step 2: Define Evaluation Descriptors

Select and configure the descriptors that define your evaluation criteria. Evidently provides both model-based evaluators (Sentiment, NegativityLLMEval, DeclineLLMEval) and statistical descriptors (SentenceCount, TextLength).

Key considerations:

LLM-based evaluators (NegativityLLMEval, DeclineLLMEval) require an LLM API key for judge calls
Model-based descriptors like Sentiment use local NLTK models
Each descriptor targets a specific column (typically the response column)
Descriptors can be aliased for readable output column names

Step 3: Create Evaluated Dataset

Wrap each LLM interaction (or batch) as an Evidently Dataset with the configured descriptors. The Dataset constructor applies all descriptors, generating new columns with evaluation scores for each row.

Pseudocode:

dataset = Dataset.from_pandas(
    response_dataframe,
    data_definition=DataDefinition(),
    descriptors=[sentiment, negativity, decline, sentence_count]
)
scored_dataframe = dataset.as_dataframe()

Step 4: Extract Evaluation Scores

Access the evaluated dataframe to retrieve individual descriptor scores per response. Each descriptor produces a column (using its alias) containing the evaluation result (numeric score, category label, or boolean).

What happens:

Sentiment returns a float polarity score (-1 to 1)
LLM judges return categorical labels (e.g., "negative"/"not negative")
SentenceCount returns an integer
Results are accessible via the alias column names in the dataframe

Step 5: Store Results in Database

Insert the evaluation scores along with the original question, response, token counts, and timestamp into a PostgreSQL table for persistent storage and time-series analysis.

Key considerations:

Design the table schema to accommodate all descriptor output types
Include the raw question and response for debugging
Use timestamps for ordering and time-based Grafana queries

Step 6: Visualize and Monitor

Configure Grafana dashboards with panels that plot evaluation metrics over time, enabling detection of quality trends, anomalies, and threshold violations.

Key considerations:

Track sentiment trends to detect tone degradation
Monitor decline/refusal rates for service quality
Set alerts on evaluation metrics crossing thresholds
Correlate token usage with quality scores for cost-quality analysis

Execution Diagram

GitHub URL

Workflow Repository