Workflow:Ollama Ollama Model Pull And Run

Knowledge Sources	Ollama Ollama Docs Ollama API Reference
Domains	LLMs, Model_Management, Inference
Last Updated	2026-02-14 22:00 GMT

Overview

End-to-end process for pulling a pre-built model from the Ollama registry and running local inference via the CLI or REST API.

Description

This workflow covers the primary user journey in Ollama: starting the server, pulling a model from the registry, loading it into memory with GPU-aware scheduling, and generating text responses. The process handles content-addressable blob downloads with resume support, automatic GPU detection and layer distribution, KV cache allocation, and both streaming and non-streaming inference modes. It supports text generation, chat completion, and embedding generation.

Usage

Execute this workflow when you want to run a publicly available LLM locally. You have Ollama installed and want to interact with a model (e.g., Llama 3, Gemma 3, Qwen) either through the interactive CLI terminal or programmatically via the REST API. This is the most common entry point for all Ollama users.

Execution Steps

Step 1: Server Initialization

Start the Ollama server process, which initializes the HTTP API router (using the Gin framework), the model runner scheduler, GPU discovery, and the content-addressable blob storage system. The server binds to a configurable host and port (default: localhost:11434) and begins accepting API requests.

Key considerations:

Environment variables (OLLAMA_HOST, OLLAMA_MODELS, OLLAMA_NUM_PARALLEL) configure server behavior
GPU discovery runs at startup, detecting CUDA, ROCm, Metal, or Vulkan devices
The scheduler initializes with awareness of available GPU memory for model placement decisions

Step 2: Model Resolution and Download

Resolve the requested model name to a manifest in the Ollama registry. The registry follows a Docker/OCI-inspired content-addressable storage model where each model consists of a manifest pointing to blob layers (weights, tokenizer, template, parameters). Download any missing blobs with parallel chunked transfers, resume support, and digest verification.

Key considerations:

Model names follow the format library/model:tag (e.g., llama3:latest)
Each blob is verified against its SHA-256 digest after download
Downloads support resume from partial transfers and parallel chunk fetching
The manifest and blobs are stored in the local content-addressable store

Step 3: Model Loading and GPU Scheduling

The scheduler receives an inference request and determines whether the model is already loaded or needs loading. It evaluates available GPU memory across all detected devices, calculates how many transformer layers can be offloaded to each GPU, and launches the appropriate runner process (llama.cpp-based or Go-native) with the computed layer distribution.

Key considerations:

The scheduler supports concurrent model loading across multiple GPUs
Models can be partially offloaded (some layers on GPU, rest on CPU)
The keep_alive parameter controls how long a loaded model stays in memory (default: 5 minutes)
If GPU memory is insufficient, the scheduler may evict a less recently used model

Step 4: Prompt Construction and Tokenization

Construct the input prompt by applying the model-specific chat template to the user's messages. The template engine matches the model's metadata to one of 20+ built-in template configurations (ChatML, Llama 2, Llama 3, Gemma, Mistral, etc.) and renders the conversation history with appropriate role markers, special tokens, and system prompts. The rendered text is then tokenized using the model's vocabulary (BPE, SentencePiece, or WordPiece).

Key considerations:

Templates are auto-detected from GGUF metadata or can be overridden in the Modelfile
Conversation history is truncated to fit within the model's context window
System prompts can be set at model creation time or per-request

Step 5: Inference and Token Generation

Execute autoregressive token generation using the loaded model. The process runs a prefill pass over the input tokens (computing KV cache entries for the full prompt), then iteratively generates output tokens one at a time. Each generated token goes through the sampling pipeline (temperature scaling, top-k, top-p, min-p filtering) before being selected and appended to the sequence.

Key considerations:

The KV cache stores attention state and supports reuse across requests sharing a prefix
Sampling parameters (temperature, top_k, top_p, min_p) control output diversity
Grammar-constrained decoding can enforce JSON schema compliance
Streaming mode sends each token as it is generated; non-streaming waits for completion

Step 6: Response Delivery

Stream or batch the generated tokens back to the client as the API response. Each streaming chunk includes the generated token text, and the final response includes performance metrics (total duration, prompt evaluation count, generation token count, tokens per second). For chat endpoints, responses are structured with role and content fields matching the conversation format.

Key considerations:

Streaming responses use newline-delimited JSON
The final chunk includes done=true and timing statistics
Tool calls detected in the output are parsed and returned as structured objects
Thinking/reasoning content is separated into a dedicated field when the model supports it

Execution Diagram

GitHub URL

Workflow Repository