Workflow:Confident ai Deepeval End to End LLM Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, Testing, Quality_Assurance |
| Last Updated | 2026-02-14 09:00 GMT |
Overview
End-to-end process for evaluating LLM application outputs against defined quality metrics using DeepEval's test case and evaluation framework.
Description
This workflow covers the standard procedure for black-box evaluation of LLM systems. The user treats their LLM application as a single unit, providing inputs and capturing outputs, then measuring quality using research-backed metrics such as G-Eval, Answer Relevancy, Faithfulness, and Hallucination detection. The process spans from defining evaluation criteria, through constructing test cases with input/output pairs, to running evaluations either via Pytest integration or the standalone evaluate() function. Results include per-metric scores (0-1 scale) with natural language explanations.
Usage
Execute this workflow when you have an LLM application (chatbot, RAG pipeline, or any text-generating system) and need to systematically verify the quality of its outputs. This applies when you want to measure correctness, relevancy, faithfulness, or detect hallucinations in generated text, and you do not need to evaluate individual internal components separately.
Execution Steps
Step 1: Install and Configure DeepEval
Install the DeepEval package and configure the evaluation model. By default, DeepEval uses OpenAI as the judge LLM, requiring an API key. Alternatively, configure a custom model provider (Anthropic, Azure, Bedrock, Gemini, local models, etc.) through environment variables or the CLI.
Key considerations:
- Python 3.9+ is required
- Set the OPENAI_API_KEY environment variable, or configure a custom judge model
- Optionally log in to Confident AI for cloud-based reporting via deepeval login
Step 2: Define Evaluation Metrics
Select and configure the metrics appropriate for your evaluation goals. Each metric accepts a threshold (0-1) that determines pass/fail, and optional configuration such as custom criteria for G-Eval or specific evaluation parameters.
Available metric categories:
- RAG metrics: AnswerRelevancyMetric, FaithfulnessMetric, ContextualPrecisionMetric, ContextualRecallMetric, ContextualRelevancyMetric
- Safety metrics: BiasMetric, ToxicityMetric, PIILeakageMetric
- General metrics: GEval (custom criteria), HallucinationMetric, SummarizationMetric
- Structural metrics: ExactMatchMetric, PatternMatchMetric, JsonCorrectnessMetric
Step 3: Construct Test Cases
Create LLMTestCase objects that capture the input, actual output from your LLM application, and any additional context needed by your chosen metrics. For RAG evaluations, include retrieval_context. For correctness checks, include expected_output.
What goes into a test case:
- input: The user query or prompt sent to the LLM
- actual_output: The LLM's response (required)
- expected_output: The ideal response (needed for correctness metrics)
- retrieval_context: Retrieved documents (needed for RAG metrics)
- context: Ground truth context (needed for faithfulness/hallucination)
Step 4: Run Evaluation
Execute the evaluation using one of two approaches: the Pytest-based assert_test() function for CI/CD integration, or the standalone evaluate() function for notebook and script environments. Both approaches score each test case against all specified metrics and produce detailed results.
Execution modes:
- Pytest mode: Use assert_test(test_case, metrics) inside test functions, run with deepeval test run
- Standalone mode: Use evaluate(test_cases, metrics) directly in scripts or notebooks
- Dataset mode: Use EvaluationDataset to batch-evaluate multiple test cases
Step 5: Analyze Results
Review evaluation results, which include per-metric scores, pass/fail status based on thresholds, and natural language explanations (reasons) for each score. Results can be viewed locally in the console or on the Confident AI platform for cloud-based reporting and comparison across iterations.
Result components:
- Metric score (0 to 1)
- Pass/fail status relative to threshold
- Natural language reason explaining the score
- Optional cloud dashboard link when logged in to Confident AI