Workflow:Ollama Ollama Model Pull And Run
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Management, Inference |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
End-to-end process for pulling a pre-built model from the Ollama registry and running local inference via the CLI or REST API.
Description
This workflow covers the primary user journey in Ollama: starting the server, pulling a model from the registry, loading it into memory with GPU-aware scheduling, and generating text responses. The process handles content-addressable blob downloads with resume support, automatic GPU detection and layer distribution, KV cache allocation, and both streaming and non-streaming inference modes. It supports text generation, chat completion, and embedding generation.
Usage
Execute this workflow when you want to run a publicly available LLM locally. You have Ollama installed and want to interact with a model (e.g., Llama 3, Gemma 3, Qwen) either through the interactive CLI terminal or programmatically via the REST API. This is the most common entry point for all Ollama users.
Execution Steps
Step 1: Server Initialization
Start the Ollama server process, which initializes the HTTP API router (using the Gin framework), the model runner scheduler, GPU discovery, and the content-addressable blob storage system. The server binds to a configurable host and port (default: localhost:11434) and begins accepting API requests.
Key considerations:
- Environment variables (OLLAMA_HOST, OLLAMA_MODELS, OLLAMA_NUM_PARALLEL) configure server behavior
- GPU discovery runs at startup, detecting CUDA, ROCm, Metal, or Vulkan devices
- The scheduler initializes with awareness of available GPU memory for model placement decisions
Step 2: Model Resolution and Download
Resolve the requested model name to a manifest in the Ollama registry. The registry follows a Docker/OCI-inspired content-addressable storage model where each model consists of a manifest pointing to blob layers (weights, tokenizer, template, parameters). Download any missing blobs with parallel chunked transfers, resume support, and digest verification.
Key considerations:
- Model names follow the format library/model:tag (e.g., llama3:latest)
- Each blob is verified against its SHA-256 digest after download
- Downloads support resume from partial transfers and parallel chunk fetching
- The manifest and blobs are stored in the local content-addressable store
Step 3: Model Loading and GPU Scheduling
The scheduler receives an inference request and determines whether the model is already loaded or needs loading. It evaluates available GPU memory across all detected devices, calculates how many transformer layers can be offloaded to each GPU, and launches the appropriate runner process (llama.cpp-based or Go-native) with the computed layer distribution.
Key considerations:
- The scheduler supports concurrent model loading across multiple GPUs
- Models can be partially offloaded (some layers on GPU, rest on CPU)
- The keep_alive parameter controls how long a loaded model stays in memory (default: 5 minutes)
- If GPU memory is insufficient, the scheduler may evict a less recently used model
Step 4: Prompt Construction and Tokenization
Construct the input prompt by applying the model-specific chat template to the user's messages. The template engine matches the model's metadata to one of 20+ built-in template configurations (ChatML, Llama 2, Llama 3, Gemma, Mistral, etc.) and renders the conversation history with appropriate role markers, special tokens, and system prompts. The rendered text is then tokenized using the model's vocabulary (BPE, SentencePiece, or WordPiece).
Key considerations:
- Templates are auto-detected from GGUF metadata or can be overridden in the Modelfile
- Conversation history is truncated to fit within the model's context window
- System prompts can be set at model creation time or per-request
Step 5: Inference and Token Generation
Execute autoregressive token generation using the loaded model. The process runs a prefill pass over the input tokens (computing KV cache entries for the full prompt), then iteratively generates output tokens one at a time. Each generated token goes through the sampling pipeline (temperature scaling, top-k, top-p, min-p filtering) before being selected and appended to the sequence.
Key considerations:
- The KV cache stores attention state and supports reuse across requests sharing a prefix
- Sampling parameters (temperature, top_k, top_p, min_p) control output diversity
- Grammar-constrained decoding can enforce JSON schema compliance
- Streaming mode sends each token as it is generated; non-streaming waits for completion
Step 6: Response Delivery
Stream or batch the generated tokens back to the client as the API response. Each streaming chunk includes the generated token text, and the final response includes performance metrics (total duration, prompt evaluation count, generation token count, tokens per second). For chat endpoints, responses are structured with role and content fields matching the conversation format.
Key considerations:
- Streaming responses use newline-delimited JSON
- The final chunk includes done=true and timing statistics
- Tool calls detected in the output are parsed and returned as structured objects
- Thinking/reasoning content is separated into a dedicated field when the model supports it