Heuristic:Microsoft Autogen Model Context Limiting
| Knowledge Sources | |
|---|---|
| Domains | Multi_Agent_Systems, Optimization |
| Last Updated | 2026-02-11 18:00 GMT |
Overview
Use `BufferedChatCompletionContext` or `TokenLimitedChatCompletionContext` to limit the context window sent to the model and prevent token limit errors in long conversations.
Description
In multi-agent conversations, the message history grows with every turn. Without context limiting, the entire history is sent to the model on each inference, which can exceed the model's token limit and cause errors or excessive costs. AutoGen provides two built-in context managers: `BufferedChatCompletionContext` (limits by message count) and `TokenLimitedChatCompletionContext` (limits by token count). Custom contexts can be created by subclassing `ChatCompletionContext`.
Usage
Use this heuristic when:
- Running long conversations (10+ turns) where history accumulates
- Using models with smaller context windows (e.g., 4K or 8K token models)
- Optimizing cost by reducing the number of tokens sent per inference
- Working with tool-heavy workflows where tool call/result messages inflate the context rapidly
The Insight (Rule of Thumb)
- Action: Set the `model_context` parameter when creating an AssistantAgent.
- Value: `BufferedChatCompletionContext(buffer_size=10)` for recent-messages-only, or `TokenLimitedChatCompletionContext(model_client=client)` for token-aware limiting.
- Trade-off: Older messages are dropped from the context, so the agent loses awareness of earlier conversation. This is usually acceptable for task-focused agents but may cause issues for agents that need full history awareness.
from autogen_core.model_context import BufferedChatCompletionContext
from autogen_agentchat.agents import AssistantAgent
# Option 1: Keep only the last N messages
agent = AssistantAgent(
name="assistant",
model_client=client,
model_context=BufferedChatCompletionContext(buffer_size=10),
)
# Option 2: Limit by token count (auto-adapts to model)
from autogen_core.model_context import TokenLimitedChatCompletionContext
agent = AssistantAgent(
name="assistant",
model_client=client,
model_context=TokenLimitedChatCompletionContext(model_client=client),
)
Reasoning
From the AssistantAgent docstring at `_assistant_agent.py:175-184`:
# You can limit the number of messages sent to the model by setting
# the model_context parameter to a BufferedChatCompletionContext.
# This will limit the number of recent messages sent to the model and
# can be useful when the model has a limit on the number of tokens it
# can process.
# Another option is to use a TokenLimitedChatCompletionContext which
# will limit the number of tokens sent to the model.
# You can also create your own model context by subclassing
# ChatCompletionContext.
In multi-agent group chats, every participant's message is published to all other participants, causing context growth proportional to (participants x turns). For a 5-agent group chat running 20 turns, that is 100+ messages in the context. With tool calls, each turn may generate multiple messages (tool call + tool result), further inflating the context.
The `BufferedChatCompletionContext` is simpler and predictable (always keeps N most recent messages). The `TokenLimitedChatCompletionContext` is smarter — it counts tokens and dynamically trims to fit the model's context window, which is better when messages vary significantly in length.