Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Openai Evals Persistent Memory Management

From Leeroopedia
Knowledge Sources
Domains Evaluation, State Management, Multi-Turn Reasoning
Last Updated 2026-02-14 10:00 GMT

Overview

A state management mechanism that maintains internal reasoning context across multi-turn solver interactions while selectively hiding private messages from the evaluation framework's view.

Description

Persistent Memory Management solves the tension between two competing requirements in multi-turn evaluation: solvers need continuous access to their own reasoning history (such as chain-of-thought steps, intermediate computations, or self-consistency deliberations), but the evaluation framework should only see the final public-facing responses that represent the solver's actual answers.

The PersistentMemoryCache class implements this by maintaining a dual-visibility message store. Every message added to the cache is tagged as either private or public:

  • Private messages are internal reasoning artefacts -- chain-of-thought prompts and responses, self-reflection steps, and any intermediate computation that the solver uses to arrive at its answer. These messages are visible to the solver on subsequent turns but invisible to the evaluation framework.
  • Public messages are the solver's final responses intended for the eval. These are visible to both the solver and the evaluation framework.

On each turn of a multi-turn interaction, the memory cache performs the following operations:

  • Reinsertion: Private messages from previous turns are spliced back into the conversation history before the solver processes the new turn, giving the solver full access to its reasoning context.
  • Stripping: Before the evaluation framework receives the conversation history, all private messages are removed, presenting a clean sequence of only the public-facing exchanges.

This architecture is critical for compositional solvers that layer reasoning strategies. For example, the CoTSolver (chain-of-thought) generates an internal reasoning trace before producing a final answer. Without persistent memory, the reasoning trace from turn N would be lost by turn N+1. With persistent memory, the CoT reasoning from all previous turns remains available to inform future reasoning, while the eval only ever sees the extracted final answers.

Usage

Apply persistent memory management in the following scenarios:

  • Chain-of-thought (CoT) evaluation where intermediate reasoning steps must persist across turns but not be scored.
  • Self-consistency evaluation where multiple reasoning paths are generated and aggregated privately before emitting a single public answer.
  • Any multi-turn solver that maintains internal state beyond what the evaluation framework should observe.

The memory cache is typically managed internally by solver implementations and does not require direct configuration in YAML files. Solvers that use it inherit the behaviour through composition:

solver:
  class: evals.solvers.cot_solver:CoTSolver
  args:
    cot_solver:
      class: evals.solvers.openai_solver:OpenAISolver
      args:
        model: gpt-4
    extract_solver:
      class: evals.solvers.openai_solver:OpenAISolver
      args:
        model: gpt-4

In this configuration, the CoTSolver internally uses the PersistentMemoryCache to track CoT reasoning (private) and extracted answers (public) across turns.

Theoretical Basis

The theoretical foundation draws from the information hiding principle in software engineering and the concept of working memory in cognitive science. Just as human working memory retains intermediate reasoning steps that are not part of the final communicated answer, the persistent memory cache retains solver-internal reasoning that is not part of the evaluated output.

The algorithm proceeds as follows:

1. Receive new turn from the evaluation framework:
   - eval_messages: the public conversation history as seen by the eval

2. Reconstruct the solver's full context:
   - Retrieve all private messages from the memory cache
   - Interleave private messages at their original positions within eval_messages
   - Result: full_context = eval_messages + reinserted private messages

3. Pass full_context to the solver for processing:
   - Solver generates intermediate reasoning (tagged as private)
   - Solver generates final answer (tagged as public)

4. Update the memory cache:
   - Store new private messages with their position indices
   - Store new public messages

5. Return only public messages to the evaluation framework

The invariant maintained by this system is: the evaluation framework's view of the conversation is always a strict subsequence of the solver's view. No information flows from private messages into the eval's scoring context.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment