Workflow:Openai Evals Implementing a custom completion function
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, Model_Integration, Software_Engineering |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
End-to-end process for implementing, registering, and using a custom completion function to evaluate non-standard model configurations or augmented LLM pipelines.
Description
This workflow covers the creation of custom CompletionFn implementations that extend the evals framework beyond direct OpenAI model calls. Completion functions generalize the concept of model completions: they accept a prompt and return text output, but can perform arbitrary processing under the hood. This enables evaluation of RAG (retrieval-augmented generation) pipelines, chain-of-thought wrappers, LangChain integrations, multi-model ensembles, or any custom inference setup. The framework also supports a newer Solver abstraction for stateful, multi-turn evaluation scenarios with tool use.
Usage
Execute this workflow when you need to evaluate a model setup that goes beyond simple API calls, such as a retrieval-augmented system, a LangChain chain, a custom prompt engineering pipeline, or a model hosted on a non-OpenAI platform. This is also the path for integrating third-party model providers (Anthropic, Google Gemini, Together AI) or custom inference servers.
Execution Steps
Step 1: Understand the CompletionFn Protocol
Review the CompletionFn protocol defined in evals/api.py. A completion function is a callable that accepts a prompt (either a string or a list of chat message dicts) and optional keyword arguments, and returns a CompletionResult object whose get_completions method yields a list of response strings. Understanding this interface is the contract your implementation must fulfill to be compatible with all evals.
Key considerations:
- The CompletionFn protocol uses Python's Protocol (structural subtyping)
- Input is either a raw string prompt or chat-format message list
- Output must be a CompletionResult with a get_completions() -> list[str] method
- The framework handles prompt format conversion between chat and string formats
Step 2: Implement the CompletionFn Class
Create a Python class that implements the CompletionFn protocol. The __call__ method should accept a prompt and return a CompletionResult. For simple wrappers, extend the existing classes. For complex pipelines, implement from scratch. Also implement a corresponding CompletionResult subclass. For stateful or multi-turn scenarios, consider implementing the newer Solver interface instead, which accepts TaskState and returns SolverResult.
Key considerations:
- Place implementation files in evals/completion_fns/ or your own project directory
- Existing examples include OpenAI chat/completion, LangChain LLM, RAG retrieval, and CoT wrappers
- The Solver interface supports multi-turn conversations, tool use, and persistent memory
- Bridge utilities in evals/solvers/utils.py convert between CompletionFn and Solver
Step 3: Register in the YAML Registry
Create a YAML file in evals/registry/completion_fns/ (or evals/registry/solvers/ for Solver implementations) that maps a name to the class path and constructor arguments. The name becomes the identifier used with the oaieval CLI. For external projects, use the --registry_path flag to point to your project's registry directory.
Key considerations:
- The YAML key is the name used with oaieval (e.g., langchain/llm/flan-t5-xl)
- The "class" field uses dotted module path with colon separator (module.path:ClassName)
- The "args" field passes keyword arguments to the constructor
- External registration works via --registry_path without modifying the evals repo
Step 4: Test Against Existing Evals
Run the custom completion function against existing evals to validate it works correctly. Start with simple test evals (test-match, test-basic) to verify the interface contract, then run domain-specific evals. Check that the completion function handles both chat and string prompt formats correctly.
Key considerations:
- test-match and test-basic are lightweight evals suitable for integration testing
- Verify that the completion function returns valid CompletionResult objects
- Check edge cases: empty prompts, very long prompts, special characters
- For Solver implementations, test with SolverEval-based evals
Step 5: Run Production Evals
Once validated, use the custom completion function in production eval runs. Compare results against baseline models to assess the impact of the custom pipeline (e.g., does RAG improve accuracy on knowledge-intensive evals? Does CoT improve reasoning evals?).
Key considerations:
- Multiple completion functions can be comma-separated in a single oaieval run
- Use --completion_args to pass additional runtime parameters
- Results include the same metrics and logging as standard model evals
- Compare with direct model baselines to quantify pipeline improvements