Principle:Ollama Ollama ThinkingSupport

Knowledge Sources	Ollama
Domains	Reasoning, Chain-of-Thought
Last Updated	2025-02-15 00:00 GMT

Overview

Thinking Support enables chain-of-thought reasoning in LLM output by detecting, parsing, and structuring "thinking" segments within the model's generation stream, separating internal reasoning from the final user-facing response.

Core Concepts

Chain-of-Thought Paradigm

Chain-of-thought (CoT) prompting encourages models to articulate intermediate reasoning steps before producing a final answer. Models trained with thinking capabilities emit their reasoning within designated tags (e.g., <think>...</think>) followed by the actual response. This separation allows the system to expose or hide the reasoning trace depending on user preference and application requirements.

Thinking Tag Detection

The thinking parser monitors the token stream for opening and closing thinking tags. Different model families use different tag formats (e.g., <think>, <|thinking|>). The parser must handle these variants and correctly identify the boundaries of thinking segments even when tokens are emitted incrementally and tag boundaries may span multiple tokens.

Stream Segmentation

Once thinking tags are detected, the output stream is segmented into thinking content and response content. The thinking content captures the model's intermediate reasoning and is typically delivered in a separate field of the API response. The response content contains only the final answer intended for the end user. This segmentation happens in real-time as tokens are generated, maintaining streaming capability.

Template Integration

Thinking support is integrated with the prompt template system. Templates for thinking-capable models include instructions or system prompts that activate the thinking mode. The template can also control whether thinking is enabled or disabled for a given request, and how the thinking tags are formatted in the prompt context.

User Control

The API exposes a think parameter that allows users to enable or disable thinking mode per request. When disabled, thinking tags are still parsed but the thinking content is suppressed from the output. When enabled, both thinking and response segments are delivered, allowing applications to display the reasoning process alongside the answer.

Implementation Notes

The thinking parser is implemented in thinking/parser.go, which provides a state machine that processes the token stream and identifies thinking boundaries. The template integration is in thinking/template.go, which defines how thinking tags are rendered in prompt templates for different model families. The parser integrates with the runner's output processing to split the generation stream into thinking and content segments before they reach the API response layer.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment