Principle:Ollama Ollama ThinkingSupport
| Knowledge Sources | |
|---|---|
| Domains | Reasoning, Chain-of-Thought |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Thinking Support enables chain-of-thought reasoning in LLM output by detecting, parsing, and structuring "thinking" segments within the model's generation stream, separating internal reasoning from the final user-facing response.
Core Concepts
Chain-of-Thought Paradigm
Chain-of-thought (CoT) prompting encourages models to articulate intermediate reasoning steps before producing a final answer. Models trained with thinking capabilities emit their reasoning within designated tags (e.g., <think>...</think>) followed by the actual response. This separation allows the system to expose or hide the reasoning trace depending on user preference and application requirements.
Thinking Tag Detection
The thinking parser monitors the token stream for opening and closing thinking tags. Different model families use different tag formats (e.g., <think>, <|thinking|>). The parser must handle these variants and correctly identify the boundaries of thinking segments even when tokens are emitted incrementally and tag boundaries may span multiple tokens.
Stream Segmentation
Once thinking tags are detected, the output stream is segmented into thinking content and response content. The thinking content captures the model's intermediate reasoning and is typically delivered in a separate field of the API response. The response content contains only the final answer intended for the end user. This segmentation happens in real-time as tokens are generated, maintaining streaming capability.
Template Integration
Thinking support is integrated with the prompt template system. Templates for thinking-capable models include instructions or system prompts that activate the thinking mode. The template can also control whether thinking is enabled or disabled for a given request, and how the thinking tags are formatted in the prompt context.
User Control
The API exposes a think parameter that allows users to enable or disable thinking mode per request. When disabled, thinking tags are still parsed but the thinking content is suppressed from the output. When enabled, both thinking and response segments are delivered, allowing applications to display the reasoning process alongside the answer.
Implementation Notes
The thinking parser is implemented in thinking/parser.go, which provides a state machine that processes the token stream and identifies thinking boundaries. The template integration is in thinking/template.go, which defines how thinking tags are rendered in prompt templates for different model families. The parser integrates with the runner's output processing to split the generation stream into thinking and content segments before they reach the API response layer.