Workflow:Promptfoo Promptfoo LLM Evaluation

Knowledge Sources	Promptfoo Promptfoo Docs Getting Started Configuration Guide
Domains	LLM_Ops, Testing, Quality_Assurance
Last Updated	2026-02-14 08:00 GMT

Overview

End-to-end process for evaluating LLM application quality by running prompts against multiple providers, applying assertions, and producing comparative results.

Description

This workflow is the core "Golden Path" of Promptfoo. It orchestrates the complete cycle of loading a YAML configuration file, resolving prompt templates and LLM providers, executing test cases concurrently against each provider, grading responses with deterministic and model-assisted assertions, and producing output in multiple formats (JSON, HTML, CSV). The evaluation engine supports comparing multiple models side-by-side, caching API responses, rate limiting, and progressive result streaming via the web UI.

Usage

Execute this workflow when you need to:

Compare output quality across multiple LLM providers (e.g., GPT-5 vs Claude vs Gemini)
Validate that prompt changes maintain or improve response quality
Run regression tests against a set of known-good expected outputs
Measure factuality, relevance, toxicity, or custom metrics across test cases
Generate comparative reports for stakeholder review

Input state: A YAML configuration file (promptfooconfig.yaml) containing prompts, providers, and test cases.

Output state: Evaluation results with per-test scores, assertion pass/fail status, and aggregate metrics available in the web UI or exported files.

Execution Steps

Step 1: Configuration Loading

Parse the YAML configuration file and resolve all references. The loader performs a two-pass rendering using Nunjucks templates: first rendering environment variables, then rendering the full configuration. JSON Schema references ($ref) are dereferenced to support modular configs split across multiple files. File-based prompts, test cases, and provider configs are resolved from their paths.

Key considerations:

Configuration supports YAML, JSON, and TypeScript formats
Environment variables can be injected via env block or .env file
Test cases can be loaded from files, directories, CSV, JSONL, or HuggingFace datasets
Prompts support Nunjucks templating with variable placeholders (e.g., {{query}})

Step 2: Provider Resolution

Instantiate the LLM providers specified in the configuration. Each provider string (e.g., openai:gpt-5, anthropic:messages:claude-sonnet-4-20250514) is resolved through a registry that maps prefixes to provider classes. Cloud-hosted provider configurations can be fetched from the Promptfoo database. Environment variables for API keys are merged from multiple sources: base config, provider-specific overrides, and system environment.

Key considerations:

Over 70 built-in providers covering OpenAI, Anthropic, Google, AWS Bedrock, Azure, Ollama, and more
Custom providers can be implemented in JavaScript, TypeScript, Python, Ruby, or Go
Providers implement a common ApiProvider interface for uniform execution
Provider-level configuration (temperature, max tokens, etc.) can be set per-provider

Step 3: Test Suite Construction

Build the complete test matrix by combining prompts, providers, and test cases. Each prompt is processed through the prompt loader, which handles inline strings, file references, and function-based prompts. The evaluator constructs a cartesian product: every prompt is tested against every provider with every test case, producing the full evaluation matrix.

Key considerations:

Default test assertions can be applied globally across all test cases
Test cases support variable substitution with typed values (strings, objects, arrays)
Scenarios can group related tests with shared variables
Filters can limit execution to specific prompts or providers

Step 4: Evaluation Execution

Run each test case through the evaluation engine concurrently. For each prompt-provider-test combination, the engine renders the prompt template with test variables, sends the request to the provider, and collects the response. Concurrency is controlled by maxConcurrency settings. Results are cached by default to avoid redundant API calls. A progress bar tracks completion in the terminal.

Key considerations:

Rate limiting is handled automatically with configurable retry logic
Caching can be disabled with --no-cache for fresh results
Transform functions can modify outputs before assertion evaluation
Delay settings can throttle requests to respect API rate limits

Step 5: Assertion Grading

Evaluate each response against the defined assertions. Assertions come in three categories: deterministic (contains, equals, regex, JSON schema), model-graded (LLM-rubric, factuality, relevance, faithfulness), and code-based (JavaScript, Python, Ruby functions). Each assertion produces a pass/fail result with an optional score. Multiple assertions on a single test case are combined using configurable logic (AND by default).

Key considerations:

Model-graded assertions use a separate LLM (configurable) to judge response quality
Named metrics aggregate scores across test cases for trend analysis
Threshold-based assertions support numeric scoring (0-1 scale)
Custom assertion functions receive the full context (output, prompt, variables, provider)

Step 6: Result Output

Aggregate all evaluation results and produce output in the requested format. Results include per-test scores, assertion details, token usage, latency metrics, and cost estimates. Output formats include JSON, YAML, CSV, HTML reports, and the interactive web UI. Results are stored in the local SQLite database for historical comparison and can be shared via generated URLs.

Key considerations:

The web UI provides an interactive matrix view comparing all prompt-provider combinations
Results can be exported to multiple formats simultaneously
Historical results are preserved in the database for regression tracking
Share URLs allow team collaboration on evaluation results

Execution Diagram

GitHub URL

Workflow Repository