Heuristic:Openai Evals Chat Format Recommendation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Evaluation, Best_Practices |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
Best practice recommendation to use chat message format for all eval prompts, even when evaluating non-chat models.
Description
The OpenAI Evals framework supports both plain string prompts and structured chat message format (list of `{"role": ..., "content": ...}` dicts) for the `"input"` field in JSONL eval data. The official documentation explicitly recommends using chat format for all evals, regardless of whether the target model is a chat model or a completion model. The framework's `CompletionFn` implementations handle the conversion between formats internally via `evals/prompt/base.py`.
Usage
Use this heuristic when creating new eval datasets or designing JSONL sample files. Always prefer chat format over plain string format for the `"input"` field.
The Insight (Rule of Thumb)
- Action: Use chat message format (`[{"role": "system", "content": "..."}, {"role": "user", "content": "..."}]`) for the `"input"` field in eval JSONL files.
- Value: N/A (format choice, not a numeric value).
- Trade-off: Slightly more verbose JSONL files, but better compatibility across all model types and clearer prompt structure.
Reasoning
Chat format provides several advantages:
- Universal compatibility: The framework automatically converts chat format to plain text for non-chat models via `CompletionPrompt.to_formatted_prompt()` and similar utilities.
- Structured prompts: System messages, user messages, and assistant messages are clearly separated, making prompts easier to read and maintain.
- Future-proofing: Most modern LLMs are chat models. String-format prompts work but may produce unexpected behavior when used with chat-tuned models.
The official build-eval documentation states: "We recommend chat format even if you are evaluating non-chat models."
Code Evidence
Documentation excerpt from `docs/build-eval.md:32`:
All templates expect an "input" key, which is the prompt, ideally
specified in chat format (though strings are also supported).
We recommend chat format even if you are evaluating non-chat models.
Prompt conversion utilities in `evals/prompt/base.py:1-122` handle the transformation between chat and completion formats, ensuring compatibility regardless of model type.