Principle:Microsoft Autogen State Persistence

Knowledge Sources	Microsoft AutoGen AutoGen State Docs
Domains	Multi-Agent Systems, State Management, Conversation Persistence, Human-in-the-Loop
Last Updated	2026-02-11 00:00 GMT

Overview

State persistence is the capability to capture, serialize, and restore the complete conversational and operational state of a multi-agent swarm, enabling pause-resume workflows and long-running interactions.

Description

Multi-agent swarm workflows frequently require interruption and resumption. The most common driver is the human-in-the-loop pattern, where a swarm pauses (via HandoffTermination) to collect user input and then resumes processing. However, state persistence addresses a broader set of needs:

Session continuity: Users may close a browser or disconnect, and the swarm should resume exactly where it left off.
Checkpointing: Long-running workflows benefit from periodic state snapshots that allow recovery from failures.
Portability: Saved state can be transferred between different runtime instances or environments.
Debugging: Captured state enables replaying conversations from specific points.

State persistence operates at two levels:

Task-level state (TaskResult): After each execution, the swarm produces a TaskResult containing the complete message sequence and the stop reason. This is a lightweight, read-only snapshot of the conversation outcome.
Team-level state (save_state/load_state): The full internal state of the swarm, including each agent's model context, the group chat manager's message thread, turn counter, and current speaker. This enables true pause-resume functionality.

The team-level state is structured as a nested dictionary with an agent_states key, containing the serialized state of each participant and the group chat manager, keyed by agent name. This structure is designed for portability: agent names are used as keys (not internal agent IDs) so that state can be transferred between different runtime instances.

Usage

Use state persistence when:

Building human-in-the-loop workflows where the swarm pauses for user input.
Implementing long-running multi-step swarm workflows that may span multiple sessions.
Creating checkpointed workflows that can recover from failures.
Transferring swarm state between different runtime environments.
Debugging by capturing and replaying specific conversation states.

Theoretical Basis

State persistence in swarm systems follows the memento pattern from software design, where the internal state of an object is externalized without exposing its implementation details.

State Persistence Model:

TaskResult (lightweight snapshot):
  {
    messages: [m_1, m_2, ..., m_n],  // Complete message sequence
    stop_reason: "Handoff to user from Alice detected."  // Why execution stopped
  }

Team State (full checkpoint):
  {
    "agent_states": {
      "agent_1": {
        // Agent 1's model context, memory, internal state
      },
      "agent_2": {
        // Agent 2's model context, memory, internal state
      },
      "SwarmGroupChatManager": {
        "message_thread": [...],     // Full conversation thread
        "current_turn": 5,           // Turn counter
        "current_speaker": "agent_2" // Who speaks next on resume
      }
    }
  }

Pause-Resume Workflow:

  1. EXECUTE swarm with HandoffTermination(target="user")
  2. Swarm pauses, returns TaskResult with stop_reason
  3. SAVE state via team.save_state() -> state_dict
  4. (Optional) Serialize state_dict to persistent storage
  5. (Optional) Deserialize and load via team.load_state(state_dict)
  6. RESUME swarm with HandoffMessage(source="user", target=agent, content=user_input)
  7. Swarm continues from exact point of interruption

Critical constraints:
  - save_state() should NOT be called while the team is running
  - load_state() CANNOT be called while the team is running (raises RuntimeError)
  - State format changed in v0.4.9: agent names used as keys instead of agent IDs
  - All participant names in saved state must match current team configuration

The two-tier approach (TaskResult for lightweight reads, save_state/load_state for full checkpoints) provides flexibility: simple workflows only need TaskResult to track outcomes, while complex multi-session workflows use the full state persistence API.

Related Pages

Implemented By

Implementation:Microsoft_Autogen_TaskResult_State

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment