Principle:Neuml Txtai Agent Execution
| Knowledge Sources | |
|---|---|
| Domains | NLP, Agent, Tool_Use |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Agent Execution is the principle of running the iterative reasoning-action loop where the agent renders a prompt with memory context, invokes the smolagents process runner to interleave LLM inference with tool calls, and returns a final answer while updating conversational memory.
Description
Once an agent has been configured (see Agent Configuration), it must be executed against a user query. The Agent Execution principle describes the runtime lifecycle that transforms a text query into a final answer, potentially involving multiple rounds of LLM reasoning and tool invocation.
The execution flow, triggered by calling the Agent instance as a function (Agent.__call__), proceeds through several stages:
1. Parameter Setup
The maxlength parameter (defaulting to 8192 tokens) is forwarded to PipelineModel.parameters() to configure the maximum sequence length for the underlying LLM.
2. Memory Reset (optional)
If reset=True is passed, the agent's memory deque is cleared, discarding all prior conversation context. This is useful when starting a new conversation thread.
3. Prompt Rendering
The prompt method uses a Jinja2 template to combine the user's text with accumulated memory. The default template places the user's query first, followed by a memory section (if any prior exchanges exist) that instructs the agent to use conversation history when relevant, or ignore it if irrelevant. A custom template can be injected at construction time for domain-specific formatting.
4. Agent Loop Execution
The rendered prompt is passed to self.process.run(), which triggers the smolagents iterative loop:
- The LLM generates a response that may contain a tool-call action.
- If a tool call is detected, the framework executes the corresponding tool's forward method and appends the result as an observation.
- The LLM is called again with the updated context (including the observation).
- This repeats until the LLM produces a final answer or max_steps is reached.
5. Memory Update
After the loop completes, the (text, output) pair is appended to the memory deque (if memory is enabled). This makes the exchange available as context for subsequent calls.
6. Return
The final output string is returned to the caller. If stream=True was passed, the output is streamed incrementally.
Usage
Use Agent Execution when:
- You have a configured Agent instance and need to process a user query.
- You want multi-turn conversations where the agent remembers prior exchanges.
- You need to control sequence length limits on a per-call basis.
- You want to stream responses for real-time user interfaces.
- You need to reset conversation context at the start of a new session.
Theoretical Basis
1. The ReAct Loop
Agent Execution implements the ReAct (Reasoning + Acting) paradigm. The LLM alternates between generating reasoning traces ("I need to search for X") and action calls (invoking the search tool). Each observation feeds back into the next reasoning step, creating a closed loop that converges on an answer.
Agent.__call__(text)
|
+-> parameters(maxlength) # configure LLM token limit
+-> reset memory (if requested)
+-> prompt = render(text, memory) # Jinja2 template
+-> output = process.run(prompt) # smolagents loop:
| |
| +-> LLM(prompt) -> thought + action
| +-> Tool.forward(args) -> observation
| +-> LLM(prompt + observation) -> thought + action | final_answer
| +-> ... (repeat up to max_steps)
|
+-> memory.append((text, output)) # update sliding window
+-> return output
2. Prompt Templating with Memory
The Jinja2 template mechanism provides a clean separation between the query content, the memory context, and the formatting logic. The default template is designed to be instruction-compatible: it tells the agent to use memory when relevant and ignore it when not, avoiding the common failure mode where stale context overrides fresh tool results.
3. Bounded Memory
By using a collections.deque with a fixed maxlen, the agent automatically evicts the oldest exchanges when memory is full. This provides a simple form of recency bias -- the agent always has access to the most recent interactions without the risk of exceeding the LLM's context window.
4. Streaming Support
The stream parameter is forwarded to process.run(), enabling incremental output delivery. This is essential for interactive applications where latency must be minimised.