Principle:Openai Evals Chain of Thought Prompting
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Prompting Strategy, Reasoning |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
A principle that improves model reasoning accuracy by eliciting explicit step-by-step thinking before extracting a final answer, decomposing complex problems into manageable intermediate steps.
Description
Chain-of-Thought (CoT) prompting is a prompting strategy that instructs the model to produce intermediate reasoning steps before arriving at a final answer. Rather than directly generating a conclusion, the model first thinks through the problem, making its reasoning process explicit and traceable.
The Openai Evals framework implements CoT through a two-phase architecture:
- Phase 1 -- Reasoning (cot_solver): The model receives the original task along with an instruction to reason step-by-step. It produces a detailed chain of thought that breaks the problem into sub-steps, performs intermediate calculations or logical deductions, and works toward a conclusion. This phase prioritizes completeness and correctness of reasoning over conciseness.
- Phase 2 -- Extraction (extract_solver): A second solver (which may be the same or a different model) receives the chain of thought from Phase 1 and extracts a clean, formatted final answer. This separation ensures that the reasoning process is unconstrained by output format requirements, while the final answer adheres to the expected evaluation format (e.g., a single letter for MCQ, a number for math problems).
This two-phase design offers several advantages:
- Separation of concerns -- reasoning quality is decoupled from answer formatting.
- Debuggability -- the intermediate chain of thought can be inspected to understand why the model reached a particular answer.
- Flexibility -- the extraction solver can be tailored to different answer formats without modifying the reasoning phase.
- Composability -- CoT can be combined with other principles such as few-shot prompting or self-consistency for further improvements.
Usage
Apply chain-of-thought prompting when:
- The task involves multi-step reasoning such as mathematical word problems, logical deductions, or multi-hop question answering.
- Direct prompting produces incorrect answers that stem from skipped reasoning steps rather than lack of knowledge.
- You need interpretable model outputs where the reasoning process can be audited.
- The model is sufficiently large -- CoT primarily benefits models above a certain scale threshold (typically 100B+ parameters), as smaller models may produce incoherent chains of thought.
- You want to compose with other strategies such as self-consistency (sampling multiple CoT paths) or few-shot learning (providing example reasoning chains).
Theoretical Basis
The theoretical motivation for CoT comes from the observation that complex reasoning emerges when models are given space to think. Wei et al. (2022) demonstrated that prompting models to show their work significantly improves performance on tasks requiring arithmetic, commonsense, and symbolic reasoning.
Formal two-phase structure:
Phase 1 (Reasoning):
input: (question, "Let's think step by step.")
output: chain_of_thought = "Step 1: ... Step 2: ... Therefore, ..."
Phase 2 (Extraction):
input: (question, chain_of_thought, "Extract the final answer.")
output: final_answer = "B"
Why CoT works -- computational perspective:
Standard prompting maps input directly to output through a fixed number of transformer layers:
Standard: answer = f(question) -- fixed computation depth
CoT: answer = f(question, reasoning) -- variable computation via token generation
By generating intermediate tokens, the model effectively gains additional computation steps proportional to the length of the reasoning chain. Each generated token allows the model to perform new attention computations over all previous tokens, enabling it to build up complex intermediate representations that would not fit within a single forward pass.
Empirical pattern:
Task difficulty | Standard accuracy | CoT accuracy | Delta
-------------------+-------------------+--------------+------
Simple (1-step) | 95% | 95% | +0%
Medium (2-3 steps) | 70% | 85% | +15%
Hard (4+ steps) | 40% | 65% | +25%
The improvement is most pronounced on tasks requiring the most reasoning steps.