Heuristic:Microsoft Autogen Model Context Limiting

Knowledge Sources	Microsoft AutoGen AssistantAgent docstring
Domains	Multi_Agent_Systems, Optimization
Last Updated	2026-02-11 18:00 GMT

Overview

Use `BufferedChatCompletionContext` or `TokenLimitedChatCompletionContext` to limit the context window sent to the model and prevent token limit errors in long conversations.

Description

In multi-agent conversations, the message history grows with every turn. Without context limiting, the entire history is sent to the model on each inference, which can exceed the model's token limit and cause errors or excessive costs. AutoGen provides two built-in context managers: `BufferedChatCompletionContext` (limits by message count) and `TokenLimitedChatCompletionContext` (limits by token count). Custom contexts can be created by subclassing `ChatCompletionContext`.

Usage

Use this heuristic when:

Running long conversations (10+ turns) where history accumulates
Using models with smaller context windows (e.g., 4K or 8K token models)
Optimizing cost by reducing the number of tokens sent per inference
Working with tool-heavy workflows where tool call/result messages inflate the context rapidly

The Insight (Rule of Thumb)

Action: Set the `model_context` parameter when creating an AssistantAgent.
Value: `BufferedChatCompletionContext(buffer_size=10)` for recent-messages-only, or `TokenLimitedChatCompletionContext(model_client=client)` for token-aware limiting.
Trade-off: Older messages are dropped from the context, so the agent loses awareness of earlier conversation. This is usually acceptable for task-focused agents but may cause issues for agents that need full history awareness.

from autogen_core.model_context import BufferedChatCompletionContext
from autogen_agentchat.agents import AssistantAgent

# Option 1: Keep only the last N messages
agent = AssistantAgent(
    name="assistant",
    model_client=client,
    model_context=BufferedChatCompletionContext(buffer_size=10),
)

# Option 2: Limit by token count (auto-adapts to model)
from autogen_core.model_context import TokenLimitedChatCompletionContext
agent = AssistantAgent(
    name="assistant",
    model_client=client,
    model_context=TokenLimitedChatCompletionContext(model_client=client),
)

Reasoning

From the AssistantAgent docstring at `_assistant_agent.py:175-184`:

# You can limit the number of messages sent to the model by setting
# the model_context parameter to a BufferedChatCompletionContext.
# This will limit the number of recent messages sent to the model and
# can be useful when the model has a limit on the number of tokens it
# can process.
# Another option is to use a TokenLimitedChatCompletionContext which
# will limit the number of tokens sent to the model.
# You can also create your own model context by subclassing
# ChatCompletionContext.

In multi-agent group chats, every participant's message is published to all other participants, causing context growth proportional to (participants x turns). For a 5-agent group chat running 20 turns, that is 100+ messages in the context. With tool calls, each turn may generate multiple messages (tool call + tool result), further inflating the context.

The `BufferedChatCompletionContext` is simpler and predictable (always keeps N most recent messages). The `TokenLimitedChatCompletionContext` is smarter — it counts tokens and dynamically trims to fit the model's context window, which is better when messages vary significantly in length.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment