Workflow:Openai Evals Implementing a custom completion function

Knowledge Sources	OpenAI Evals Completion Functions Guide CompletionFn Protocol
Domains	LLM_Evaluation, Model_Integration, Software_Engineering
Last Updated	2026-02-14 10:00 GMT

Overview

End-to-end process for implementing, registering, and using a custom completion function to evaluate non-standard model configurations or augmented LLM pipelines.

Description

This workflow covers the creation of custom CompletionFn implementations that extend the evals framework beyond direct OpenAI model calls. Completion functions generalize the concept of model completions: they accept a prompt and return text output, but can perform arbitrary processing under the hood. This enables evaluation of RAG (retrieval-augmented generation) pipelines, chain-of-thought wrappers, LangChain integrations, multi-model ensembles, or any custom inference setup. The framework also supports a newer Solver abstraction for stateful, multi-turn evaluation scenarios with tool use.

Usage

Execute this workflow when you need to evaluate a model setup that goes beyond simple API calls, such as a retrieval-augmented system, a LangChain chain, a custom prompt engineering pipeline, or a model hosted on a non-OpenAI platform. This is also the path for integrating third-party model providers (Anthropic, Google Gemini, Together AI) or custom inference servers.

Execution Steps

Step 1: Understand the CompletionFn Protocol

Review the CompletionFn protocol defined in evals/api.py. A completion function is a callable that accepts a prompt (either a string or a list of chat message dicts) and optional keyword arguments, and returns a CompletionResult object whose get_completions method yields a list of response strings. Understanding this interface is the contract your implementation must fulfill to be compatible with all evals.

Key considerations:

The CompletionFn protocol uses Python's Protocol (structural subtyping)
Input is either a raw string prompt or chat-format message list
Output must be a CompletionResult with a get_completions() -> list[str] method
The framework handles prompt format conversion between chat and string formats

Step 2: Implement the CompletionFn Class

Create a Python class that implements the CompletionFn protocol. The __call__ method should accept a prompt and return a CompletionResult. For simple wrappers, extend the existing classes. For complex pipelines, implement from scratch. Also implement a corresponding CompletionResult subclass. For stateful or multi-turn scenarios, consider implementing the newer Solver interface instead, which accepts TaskState and returns SolverResult.

Key considerations:

Place implementation files in evals/completion_fns/ or your own project directory
Existing examples include OpenAI chat/completion, LangChain LLM, RAG retrieval, and CoT wrappers
The Solver interface supports multi-turn conversations, tool use, and persistent memory
Bridge utilities in evals/solvers/utils.py convert between CompletionFn and Solver

Step 3: Register in the YAML Registry

Create a YAML file in evals/registry/completion_fns/ (or evals/registry/solvers/ for Solver implementations) that maps a name to the class path and constructor arguments. The name becomes the identifier used with the oaieval CLI. For external projects, use the --registry_path flag to point to your project's registry directory.

Key considerations:

The YAML key is the name used with oaieval (e.g., langchain/llm/flan-t5-xl)
The "class" field uses dotted module path with colon separator (module.path:ClassName)
The "args" field passes keyword arguments to the constructor
External registration works via --registry_path without modifying the evals repo

Step 4: Test Against Existing Evals

Run the custom completion function against existing evals to validate it works correctly. Start with simple test evals (test-match, test-basic) to verify the interface contract, then run domain-specific evals. Check that the completion function handles both chat and string prompt formats correctly.

Key considerations:

test-match and test-basic are lightweight evals suitable for integration testing
Verify that the completion function returns valid CompletionResult objects
Check edge cases: empty prompts, very long prompts, special characters
For Solver implementations, test with SolverEval-based evals

Step 5: Run Production Evals

Once validated, use the custom completion function in production eval runs. Compare results against baseline models to assess the impact of the custom pipeline (e.g., does RAG improve accuracy on knowledge-intensive evals? Does CoT improve reasoning evals?).

Key considerations:

Multiple completion functions can be comma-separated in a single oaieval run
Use --completion_args to pass additional runtime parameters
Results include the same metrics and logging as standard model evals
Compare with direct model baselines to quantify pipeline improvements

Execution Diagram

GitHub URL

Workflow Repository