Workflow:Promptfoo Promptfoo LLM Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Testing, Quality_Assurance |
| Last Updated | 2026-02-14 08:00 GMT |
Overview
End-to-end process for evaluating LLM application quality by running prompts against multiple providers, applying assertions, and producing comparative results.
Description
This workflow is the core "Golden Path" of Promptfoo. It orchestrates the complete cycle of loading a YAML configuration file, resolving prompt templates and LLM providers, executing test cases concurrently against each provider, grading responses with deterministic and model-assisted assertions, and producing output in multiple formats (JSON, HTML, CSV). The evaluation engine supports comparing multiple models side-by-side, caching API responses, rate limiting, and progressive result streaming via the web UI.
Usage
Execute this workflow when you need to:
- Compare output quality across multiple LLM providers (e.g., GPT-5 vs Claude vs Gemini)
- Validate that prompt changes maintain or improve response quality
- Run regression tests against a set of known-good expected outputs
- Measure factuality, relevance, toxicity, or custom metrics across test cases
- Generate comparative reports for stakeholder review
Input state: A YAML configuration file (promptfooconfig.yaml) containing prompts, providers, and test cases.
Output state: Evaluation results with per-test scores, assertion pass/fail status, and aggregate metrics available in the web UI or exported files.
Execution Steps
Step 1: Configuration Loading
Parse the YAML configuration file and resolve all references. The loader performs a two-pass rendering using Nunjucks templates: first rendering environment variables, then rendering the full configuration. JSON Schema references ($ref) are dereferenced to support modular configs split across multiple files. File-based prompts, test cases, and provider configs are resolved from their paths.
Key considerations:
- Configuration supports YAML, JSON, and TypeScript formats
- Environment variables can be injected via env block or .env file
- Test cases can be loaded from files, directories, CSV, JSONL, or HuggingFace datasets
- Prompts support Nunjucks templating with variable placeholders (e.g., {{query}})
Step 2: Provider Resolution
Instantiate the LLM providers specified in the configuration. Each provider string (e.g., openai:gpt-5, anthropic:messages:claude-sonnet-4-20250514) is resolved through a registry that maps prefixes to provider classes. Cloud-hosted provider configurations can be fetched from the Promptfoo database. Environment variables for API keys are merged from multiple sources: base config, provider-specific overrides, and system environment.
Key considerations:
- Over 70 built-in providers covering OpenAI, Anthropic, Google, AWS Bedrock, Azure, Ollama, and more
- Custom providers can be implemented in JavaScript, TypeScript, Python, Ruby, or Go
- Providers implement a common ApiProvider interface for uniform execution
- Provider-level configuration (temperature, max tokens, etc.) can be set per-provider
Step 3: Test Suite Construction
Build the complete test matrix by combining prompts, providers, and test cases. Each prompt is processed through the prompt loader, which handles inline strings, file references, and function-based prompts. The evaluator constructs a cartesian product: every prompt is tested against every provider with every test case, producing the full evaluation matrix.
Key considerations:
- Default test assertions can be applied globally across all test cases
- Test cases support variable substitution with typed values (strings, objects, arrays)
- Scenarios can group related tests with shared variables
- Filters can limit execution to specific prompts or providers
Step 4: Evaluation Execution
Run each test case through the evaluation engine concurrently. For each prompt-provider-test combination, the engine renders the prompt template with test variables, sends the request to the provider, and collects the response. Concurrency is controlled by maxConcurrency settings. Results are cached by default to avoid redundant API calls. A progress bar tracks completion in the terminal.
Key considerations:
- Rate limiting is handled automatically with configurable retry logic
- Caching can be disabled with --no-cache for fresh results
- Transform functions can modify outputs before assertion evaluation
- Delay settings can throttle requests to respect API rate limits
Step 5: Assertion Grading
Evaluate each response against the defined assertions. Assertions come in three categories: deterministic (contains, equals, regex, JSON schema), model-graded (LLM-rubric, factuality, relevance, faithfulness), and code-based (JavaScript, Python, Ruby functions). Each assertion produces a pass/fail result with an optional score. Multiple assertions on a single test case are combined using configurable logic (AND by default).
Key considerations:
- Model-graded assertions use a separate LLM (configurable) to judge response quality
- Named metrics aggregate scores across test cases for trend analysis
- Threshold-based assertions support numeric scoring (0-1 scale)
- Custom assertion functions receive the full context (output, prompt, variables, provider)
Step 6: Result Output
Aggregate all evaluation results and produce output in the requested format. Results include per-test scores, assertion details, token usage, latency metrics, and cost estimates. Output formats include JSON, YAML, CSV, HTML reports, and the interactive web UI. Results are stored in the local SQLite database for historical comparison and can be shared via generated URLs.
Key considerations:
- The web UI provides an interactive matrix view comparing all prompt-provider combinations
- Results can be exported to multiple formats simultaneously
- Historical results are preserved in the database for regression tracking
- Share URLs allow team collaboration on evaluation results