Heuristic:Openai Evals Chat Format Recommendation

Knowledge Sources	OpenAI Evals Build Eval Guide
Domains	LLM_Evaluation, Best_Practices
Last Updated	2026-02-14 10:00 GMT

Overview

Best practice recommendation to use chat message format for all eval prompts, even when evaluating non-chat models.

Description

The OpenAI Evals framework supports both plain string prompts and structured chat message format (list of `{"role": ..., "content": ...}` dicts) for the `"input"` field in JSONL eval data. The official documentation explicitly recommends using chat format for all evals, regardless of whether the target model is a chat model or a completion model. The framework's `CompletionFn` implementations handle the conversion between formats internally via `evals/prompt/base.py`.

Usage

Use this heuristic when creating new eval datasets or designing JSONL sample files. Always prefer chat format over plain string format for the `"input"` field.

The Insight (Rule of Thumb)

Action: Use chat message format (`[{"role": "system", "content": "..."}, {"role": "user", "content": "..."}]`) for the `"input"` field in eval JSONL files.
Value: N/A (format choice, not a numeric value).
Trade-off: Slightly more verbose JSONL files, but better compatibility across all model types and clearer prompt structure.

Reasoning

Chat format provides several advantages:

Universal compatibility: The framework automatically converts chat format to plain text for non-chat models via `CompletionPrompt.to_formatted_prompt()` and similar utilities.
Structured prompts: System messages, user messages, and assistant messages are clearly separated, making prompts easier to read and maintain.
Future-proofing: Most modern LLMs are chat models. String-format prompts work but may produce unexpected behavior when used with chat-tuned models.

The official build-eval documentation states: "We recommend chat format even if you are evaluating non-chat models."

Code Evidence

Documentation excerpt from `docs/build-eval.md:32`:

All templates expect an "input" key, which is the prompt, ideally
specified in chat format (though strings are also supported).
We recommend chat format even if you are evaluating non-chat models.

Prompt conversion utilities in `evals/prompt/base.py:1-122` handle the transformation between chat and completion formats, ensuring compatibility regardless of model type.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment