Principle:Neuml Txtai Agent Task Execution

Overview

Once an agent has been constructed with tools and an LLM, the final step is to run tasks through it. The Agent Task Execution principle covers how the agent's iterative tool-calling loop operates, how multi-turn reasoning unfolds, and how conversational memory enables context-aware interactions across calls.

The Agent Execution Loop

When a user submits a request to an agent, the following loop executes:

Prompt construction -- The user's text is combined with any available conversational memory using a Jinja2 template.
LLM inference -- The composed prompt is sent to the LLM, which either produces a final answer or emits a tool-call action.
Tool execution -- If the LLM requested a tool call, the orchestrator invokes the tool and collects the result.
Observation injection -- The tool's output is appended to the conversation history.
Iteration -- Steps 2-4 repeat until the LLM produces a final answer or the maximum step count is reached.

This loop is the heartbeat of the agent. Each iteration adds to the growing context window, giving the model progressively more information to reason with.

Multi-Turn Reasoning

Complex questions often require multiple tool calls, each building on the results of the previous one. Consider a request like: "What is the population of the country that won the most recent FIFA World Cup?"

The agent might reason through this as follows:

Step 1: Call the search tool with "most recent FIFA World Cup winner" to identify the country.
Step 2: Call the search tool again with "[country] population" to find the population figure.
Step 3: Synthesise the two results into a final answer.

Each step represents one iteration of the execution loop. The LLM must interpret the results from each step, decide whether more information is needed, and formulate the next query appropriately.

Step Limits

To prevent infinite loops (or excessively costly runs), agents enforce a maximum step count (max_steps). When this limit is reached, the agent must produce an answer with whatever information it has gathered so far. Setting the right step limit depends on:

Task complexity -- Simple lookups need 1-3 steps; research tasks may need 5-10.
Model capability -- More capable models converge faster and need fewer steps.
Cost budget -- Each step involves an LLM call, so more steps mean higher cost.

Memory in Agents

Memory allows an agent to maintain context across separate invocations (not just within a single run). This is distinct from the within-run conversation history, which is managed by the execution loop itself.

How Memory Works

When memory is enabled (via the memory parameter):

Each completed agent run appends the (request, response) pair to a bounded buffer (a deque with a maximum length).
On the next invocation, the stored pairs are injected into the prompt via the template's {{ memory }} placeholder.
The LLM can use this history to answer follow-up questions, maintain conversational state, or avoid repeating work.

Memory Design Considerations

Window size -- The memory parameter controls how many prior interactions are retained. Too few and the agent loses context; too many and the prompt becomes bloated.
Relevance -- The default template includes an instruction: "If the history is irrelevant, forget it and use other tools to answer the question." This gives the LLM permission to ignore stale context.
Reset -- The reset=True flag on Agent.__call__ clears memory, useful when starting a new conversation topic.

Streaming

The execution loop supports streaming mode (stream=True), where partial results are yielded as the LLM generates them. Streaming is useful for:

User experience -- Showing incremental progress rather than a long wait.
Long-running tasks -- Providing intermediate feedback during multi-step reasoning.

In streaming mode, the caller receives an iterator rather than a final string.

Prompt Templating

The agent uses Jinja2 templates to compose the final prompt from the user's text and memory context. The default template is:

{{ text }}
{% if memory %}
Use the following conversation history to help answer the question above.

{{ memory }}

If the history is irrelevant, forget it and use other tools to answer the question.
{% endif %}

Custom templates can be provided to:

Add domain-specific instructions.
Change how memory is presented to the model.
Include additional context or constraints.

Templates must include the {{ text }} placeholder at minimum. The {{ memory }} placeholder is required for memory-enabled agents.

Parameter Control

Before each agent run, the maxlength parameter is forwarded to the PipelineModel via self.process.model.parameters(maxlength). This allows callers to control the maximum generation length on a per-call basis, which is useful when:

Different tasks require different response lengths.
Context window budget needs to be managed dynamically.

Error Handling and Robustness

Several mechanisms help keep the agent loop robust:

Step limits prevent runaway loops.
Stop sequences ensure the LLM's output is properly terminated.
Tool error handling -- If a tool raises an exception, the error message is fed back to the LLM as an observation, allowing it to retry or choose a different approach.
Memory bounds -- The deque with a fixed maxlen prevents unbounded memory growth.

Relationship to the Agent Execution Workflow

Task execution is the final step in the Agent_Execution workflow:

Define tools (embeddings, functions, skills).
Configure the LLM.
Create the agent.
Run agent tasks -- Submit requests and receive answers through the iterative execution loop.

This step is where all the previous configuration comes together and the agent demonstrates its value.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment