Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Openai Evals Chat Format Recommendation

From Leeroopedia
Knowledge Sources
Domains LLM_Evaluation, Best_Practices
Last Updated 2026-02-14 10:00 GMT

Overview

Best practice recommendation to use chat message format for all eval prompts, even when evaluating non-chat models.

Description

The OpenAI Evals framework supports both plain string prompts and structured chat message format (list of `{"role": ..., "content": ...}` dicts) for the `"input"` field in JSONL eval data. The official documentation explicitly recommends using chat format for all evals, regardless of whether the target model is a chat model or a completion model. The framework's `CompletionFn` implementations handle the conversion between formats internally via `evals/prompt/base.py`.

Usage

Use this heuristic when creating new eval datasets or designing JSONL sample files. Always prefer chat format over plain string format for the `"input"` field.

The Insight (Rule of Thumb)

  • Action: Use chat message format (`[{"role": "system", "content": "..."}, {"role": "user", "content": "..."}]`) for the `"input"` field in eval JSONL files.
  • Value: N/A (format choice, not a numeric value).
  • Trade-off: Slightly more verbose JSONL files, but better compatibility across all model types and clearer prompt structure.

Reasoning

Chat format provides several advantages:

  1. Universal compatibility: The framework automatically converts chat format to plain text for non-chat models via `CompletionPrompt.to_formatted_prompt()` and similar utilities.
  2. Structured prompts: System messages, user messages, and assistant messages are clearly separated, making prompts easier to read and maintain.
  3. Future-proofing: Most modern LLMs are chat models. String-format prompts work but may produce unexpected behavior when used with chat-tuned models.

The official build-eval documentation states: "We recommend chat format even if you are evaluating non-chat models."

Code Evidence

Documentation excerpt from `docs/build-eval.md:32`:

All templates expect an "input" key, which is the prompt, ideally
specified in chat format (though strings are also supported).
We recommend chat format even if you are evaluating non-chat models.

Prompt conversion utilities in `evals/prompt/base.py:1-122` handle the transformation between chat and completion formats, ensuring compatibility regardless of model type.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment