Principle:Ollama Ollama HardwareDiscovery
| Knowledge Sources | |
|---|---|
| Domains | Hardware Detection, Resource Management |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Hardware Discovery is the principle of dynamically detecting available computational resources (GPUs, CPUs, accelerators) at runtime, querying their capabilities (memory, compute capacity, driver versions), and making informed decisions about resource allocation for inference workloads. This enables a single application binary to adapt to diverse hardware environments without manual configuration.
Core Concepts
GPU Enumeration
GPU enumeration involves querying the operating system and GPU driver APIs to discover all available graphics processing units. On NVIDIA systems, this uses the NVML (NVIDIA Management Library) or CUDA runtime APIs. On AMD systems, it uses ROCm/HIP APIs or sysfs interfaces. On Apple Silicon, it uses Metal device queries. On Intel systems, it uses oneAPI/Level Zero. Each discovered GPU is identified by its device index, name, PCI bus ID, and compute capability, forming the basis for subsequent capability queries and workload assignment.
Capability Assessment
Once GPUs are enumerated, the system assesses each device's capabilities: total VRAM, available (free) VRAM, compute capability version, supported data types (FP16, BF16, INT8), maximum thread block dimensions, shared memory size, and memory bandwidth. For CPUs, this includes core count, SIMD instruction set support (AVX, AVX2, AVX-512, NEON), cache sizes, and available system RAM. This information determines which models can fit on which devices and what quantization formats and batch sizes are feasible.
Driver and Library Compatibility
Hardware discovery must verify that the installed driver versions and compute libraries are compatible with the application's requirements. A detected CUDA GPU is only usable if the CUDA toolkit version matches the application's compiled CUDA code, the driver version meets minimum requirements, and necessary shared libraries (libcudart, libcublas, libnccl) are present and loadable. The discovery process must gracefully handle cases where hardware is present but drivers are missing, outdated, or misconfigured.
Resource Allocation Strategy
Based on discovery results, the system determines how to distribute model layers across available devices. This involves calculating the memory footprint of each model layer, determining how many layers fit in each GPU's available VRAM, deciding which layers (if any) must remain on CPU, and configuring the tensor split ratios for multi-GPU inference. The strategy must account for memory overhead from KV caches, activation buffers, and CUDA/ROCm context memory that reduces the VRAM available for model weights.
Dynamic Adaptation
Hardware availability can change at runtime: GPUs may become available or unavailable due to other processes, VRAM pressure may change as models are loaded and unloaded, and thermal throttling may affect performance. A robust discovery system supports re-enumeration and re-assessment, allowing the application to adapt its resource allocation strategy in response to changing conditions rather than being locked into decisions made at startup.
Implementation Notes
In the Ollama codebase, hardware discovery is implemented through a dedicated discovery module that probes for NVIDIA GPUs (via NVML/CUDA), AMD GPUs (via ROCm/HIP and sysfs), Apple Silicon GPUs (via Metal), and Intel GPUs (via oneAPI). The module queries each discovered device for total and free VRAM, compute capability, and driver version. Discovery results are used by the scheduler to determine optimal layer placement across available devices, choosing between full GPU offload, partial offload (split across GPU and CPU), or CPU-only inference. The system also detects CPU features (AVX, AVX2, AVX-512) to select optimized CPU inference kernels. Library path resolution handles platform-specific library locations and version compatibility checking.