Workflow:Evidentlyai Evidently LLM Evaluation Monitoring
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, LLM_Evaluation, Monitoring |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
End-to-end process for evaluating LLM-powered system outputs using Evidently descriptors (sentiment analysis, LLM judges, text quality checks) and monitoring evaluation results over time via database storage and Grafana dashboards.
Description
This workflow covers the standard procedure for evaluating and monitoring Large Language Model outputs in production. It uses Evidently's descriptor framework to apply row-level evaluations to LLM responses, including sentiment analysis, LLM-as-a-judge evaluations (negativity, decline detection), and text statistics (sentence count). Evaluation results are computed per response, stored in PostgreSQL, and visualized in Grafana for ongoing quality monitoring.
Goal: A continuous monitoring pipeline that evaluates every LLM response against quality criteria and stores scored results for trend analysis and alerting.
Scope: From raw LLM question-answer pairs through descriptor-based evaluation to stored metrics and dashboard visualization.
Strategy: Uses row-level descriptors to evaluate each LLM response individually, combining model-based evaluators (NLTK sentiment, LLM judges) with rule-based checks, then persists results for time-series monitoring.
Usage
Execute this workflow when you operate a chatbot, question-answering system, or any LLM-powered application and need to continuously monitor the quality of generated responses. This is essential for detecting quality degradation, tracking tone shifts, identifying increases in refusal rates, and maintaining service level objectives for LLM outputs.
Execution Steps
Step 1: Collect LLM Interactions
Capture question-answer pairs from the LLM system along with metadata such as token usage (input/output tokens), timestamps, and any additional context. Each interaction becomes a row in the evaluation dataset.
Key considerations:
- Store both the input prompt and the generated response
- Capture token usage metrics for cost monitoring
- Include timestamps for temporal analysis
Step 2: Define Evaluation Descriptors
Select and configure the descriptors that define your evaluation criteria. Evidently provides both model-based evaluators (Sentiment, NegativityLLMEval, DeclineLLMEval) and statistical descriptors (SentenceCount, TextLength).
Key considerations:
- LLM-based evaluators (NegativityLLMEval, DeclineLLMEval) require an LLM API key for judge calls
- Model-based descriptors like Sentiment use local NLTK models
- Each descriptor targets a specific column (typically the response column)
- Descriptors can be aliased for readable output column names
Step 3: Create Evaluated Dataset
Wrap each LLM interaction (or batch) as an Evidently Dataset with the configured descriptors. The Dataset constructor applies all descriptors, generating new columns with evaluation scores for each row.
Pseudocode:
dataset = Dataset.from_pandas(
response_dataframe,
data_definition=DataDefinition(),
descriptors=[sentiment, negativity, decline, sentence_count]
)
scored_dataframe = dataset.as_dataframe()
Step 4: Extract Evaluation Scores
Access the evaluated dataframe to retrieve individual descriptor scores per response. Each descriptor produces a column (using its alias) containing the evaluation result (numeric score, category label, or boolean).
What happens:
- Sentiment returns a float polarity score (-1 to 1)
- LLM judges return categorical labels (e.g., "negative"/"not negative")
- SentenceCount returns an integer
- Results are accessible via the alias column names in the dataframe
Step 5: Store Results in Database
Insert the evaluation scores along with the original question, response, token counts, and timestamp into a PostgreSQL table for persistent storage and time-series analysis.
Key considerations:
- Design the table schema to accommodate all descriptor output types
- Include the raw question and response for debugging
- Use timestamps for ordering and time-based Grafana queries
Step 6: Visualize and Monitor
Configure Grafana dashboards with panels that plot evaluation metrics over time, enabling detection of quality trends, anomalies, and threshold violations.
Key considerations:
- Track sentiment trends to detect tone degradation
- Monitor decline/refusal rates for service quality
- Set alerts on evaluation metrics crossing thresholds
- Correlate token usage with quality scores for cost-quality analysis