Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Openclaw Openclaw LLM Inference

From Leeroopedia


Knowledge Sources
Domains Agent_Runtime, LLM_Integration
Last Updated 2026-02-06 12:00 GMT

Overview

LLM inference is the process of sending an assembled agent context to an AI provider, handling streaming responses and tool-use turns in a multi-turn loop, and producing a final result with reply payloads and execution metadata.

Description

Once the context assembly stage has prepared the system prompt, session history, tool set, and model selection, the runtime hands off to the inference layer. This layer is responsible for the actual interaction with the LLM provider, which may involve multiple round-trips when the model requests tool calls.

The inference loop follows a prompt-respond-tool-repeat cycle:

  1. Send the assembled prompt (system instructions, conversation history, user message, available tools) to the selected LLM provider.
  2. Stream the response, emitting partial text and reasoning tokens to registered callbacks.
  3. If the model requests tool calls (stop reason: toolUse), execute the requested tools within the sandbox and policy constraints, then feed the tool results back as a new turn.
  4. Repeat until the model produces a final text response (stop reason: endTurn) or an error/timeout occurs.

The inference layer also handles several cross-cutting concerns:

Auth profile rotation: If the current API key fails (rate limit, billing, authentication error), the system advances to the next configured auth profile and retries. Each profile tracks cooldown state to avoid hammering a temporarily unavailable key.

Thinking level fallback: If the model does not support the requested thinking level (e.g., extended thinking on a model that does not support it), the system falls back to a lower thinking level and retries.

Context overflow recovery: If the prompt exceeds the model's context window, the system attempts automatic session compaction (summarizing older history) and retries up to a configured number of attempts.

Model failover: When all auth profiles for the primary model are exhausted, a FailoverError signals the caller to try the next model in the configured fallback chain.

Concurrency control: Inference runs are enqueued in per-session and per-global lanes, ensuring that only one inference runs per session at a time while allowing concurrent sessions.

Usage

Apply this principle whenever:

  • Adding support for a new LLM provider or model family.
  • Modifying tool execution within the inference loop.
  • Changing retry, failover, or compaction behavior.
  • Integrating new streaming event types (e.g., new block reply modes).
  • Debugging inference failures -- trace the auth profile rotation, thinking level fallback, and compaction retry paths.

Theoretical Basis

The inference engine implements a ReAct (Reason-Act) loop -- a well-established pattern in agentic AI where the model alternates between reasoning (generating text) and acting (calling tools). The loop terminates when the model produces a final response without tool calls.

The resilience strategy follows a layered retry hierarchy:

  1. Tool-level retry: Individual tool failures are reported back to the model, which may retry with corrected arguments.
  2. Thinking-level fallback: If extended thinking fails, drop to standard thinking and retry.
  3. Auth-profile rotation: If the API key is rate-limited or unauthorized, rotate to the next profile.
  4. Compaction retry: If the context overflows, compact session history and retry.
  5. Model failover: If all profiles are exhausted, throw FailoverError for the caller to select a fallback model.

Streaming is implemented via provider-specific subscriptions that emit events for partial text, reasoning tokens, tool calls, and block boundaries. The runtime normalizes these into a uniform callback interface (onPartialReply, onBlockReply, onToolResult, onReasoningStream) that channel handlers consume for real-time delivery.

Concurrency is managed via named command queues (lanes). Each session has a dedicated lane for sequential turn processing, and a global lane provides provider-level throttling. The double-enqueue pattern ensures both session isolation and global rate compliance.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment