Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Confident ai Deepeval End to End LLM Evaluation

From Leeroopedia
Knowledge Sources
Domains LLM_Evaluation, Testing, Quality_Assurance
Last Updated 2026-02-14 09:00 GMT

Overview

End-to-end process for evaluating LLM application outputs against defined quality metrics using DeepEval's test case and evaluation framework.

Description

This workflow covers the standard procedure for black-box evaluation of LLM systems. The user treats their LLM application as a single unit, providing inputs and capturing outputs, then measuring quality using research-backed metrics such as G-Eval, Answer Relevancy, Faithfulness, and Hallucination detection. The process spans from defining evaluation criteria, through constructing test cases with input/output pairs, to running evaluations either via Pytest integration or the standalone evaluate() function. Results include per-metric scores (0-1 scale) with natural language explanations.

Usage

Execute this workflow when you have an LLM application (chatbot, RAG pipeline, or any text-generating system) and need to systematically verify the quality of its outputs. This applies when you want to measure correctness, relevancy, faithfulness, or detect hallucinations in generated text, and you do not need to evaluate individual internal components separately.

Execution Steps

Step 1: Install and Configure DeepEval

Install the DeepEval package and configure the evaluation model. By default, DeepEval uses OpenAI as the judge LLM, requiring an API key. Alternatively, configure a custom model provider (Anthropic, Azure, Bedrock, Gemini, local models, etc.) through environment variables or the CLI.

Key considerations:

  • Python 3.9+ is required
  • Set the OPENAI_API_KEY environment variable, or configure a custom judge model
  • Optionally log in to Confident AI for cloud-based reporting via deepeval login

Step 2: Define Evaluation Metrics

Select and configure the metrics appropriate for your evaluation goals. Each metric accepts a threshold (0-1) that determines pass/fail, and optional configuration such as custom criteria for G-Eval or specific evaluation parameters.

Available metric categories:

  • RAG metrics: AnswerRelevancyMetric, FaithfulnessMetric, ContextualPrecisionMetric, ContextualRecallMetric, ContextualRelevancyMetric
  • Safety metrics: BiasMetric, ToxicityMetric, PIILeakageMetric
  • General metrics: GEval (custom criteria), HallucinationMetric, SummarizationMetric
  • Structural metrics: ExactMatchMetric, PatternMatchMetric, JsonCorrectnessMetric

Step 3: Construct Test Cases

Create LLMTestCase objects that capture the input, actual output from your LLM application, and any additional context needed by your chosen metrics. For RAG evaluations, include retrieval_context. For correctness checks, include expected_output.

What goes into a test case:

  • input: The user query or prompt sent to the LLM
  • actual_output: The LLM's response (required)
  • expected_output: The ideal response (needed for correctness metrics)
  • retrieval_context: Retrieved documents (needed for RAG metrics)
  • context: Ground truth context (needed for faithfulness/hallucination)

Step 4: Run Evaluation

Execute the evaluation using one of two approaches: the Pytest-based assert_test() function for CI/CD integration, or the standalone evaluate() function for notebook and script environments. Both approaches score each test case against all specified metrics and produce detailed results.

Execution modes:

  • Pytest mode: Use assert_test(test_case, metrics) inside test functions, run with deepeval test run
  • Standalone mode: Use evaluate(test_cases, metrics) directly in scripts or notebooks
  • Dataset mode: Use EvaluationDataset to batch-evaluate multiple test cases

Step 5: Analyze Results

Review evaluation results, which include per-metric scores, pass/fail status based on thresholds, and natural language explanations (reasons) for each score. Results can be viewed locally in the console or on the Confident AI platform for cloud-based reporting and comparison across iterations.

Result components:

  • Metric score (0 to 1)
  • Pass/fail status relative to threshold
  • Natural language reason explaining the score
  • Optional cloud dashboard link when logged in to Confident AI

Execution Diagram

GitHub URL

Workflow Repository