Principle:Ggml org Ggml CPU Tensor Operations
| Attribute | Value |
|---|---|
| Page Type | Principle |
| Full Name | Ggml_org_Ggml_CPU_Tensor_Operations |
| Short Name | CPU_Tensor_Operations |
| Domain Tags | Tensor_Operations, CPU |
| Knowledge Source | GGML |
| Last Updated | 2026-02-10 |
Overview
Implementing the full set of tensor operations -- normalization, attention, convolution, pooling, activation functions, and more -- as multithreaded CPU kernels that serve as both the reference implementation and the universal fallback backend.
Description
CPU Tensor Operations is the principle of providing complete, correct, and multithreaded CPU implementations for every tensor operation defined in the GGML computation graph. The CPU backend serves a dual role: it is the reference implementation that defines the semantics of each operation, and it is the universal fallback that guarantees every operation can be executed regardless of what accelerator backends are available.
The CPU tensor operations are organized into two categories in GGML's codebase:
Multi-input operations (defined in ops.h / ops.cpp) encompass the majority of the operation set:
- Normalization:
ggml_compute_forward_norm,ggml_compute_forward_rms_norm,ggml_compute_forward_group_norm,ggml_compute_forward_l2_norm - Attention:
ggml_compute_forward_flash_attn_ext,ggml_compute_forward_flash_attn_back - Convolution:
ggml_compute_forward_conv_2d,ggml_compute_forward_conv_3d,ggml_compute_forward_conv_transpose_1d,ggml_compute_forward_conv_transpose_2d,ggml_compute_forward_conv_2d_dw - Pooling:
ggml_compute_forward_pool_1d,ggml_compute_forward_pool_2d - Matrix operations:
ggml_compute_forward_mul_mat,ggml_compute_forward_out_prod - Recurrent:
ggml_compute_forward_ssm_conv,ggml_compute_forward_ssm_scan,ggml_compute_forward_rwkv_wkv6,ggml_compute_forward_rwkv_wkv7,ggml_compute_forward_gla - Positional encoding:
ggml_compute_forward_rope,ggml_compute_forward_rope_back - Data manipulation:
ggml_compute_forward_get_rows,ggml_compute_forward_set_rows,ggml_compute_forward_concat,ggml_compute_forward_pad - Training:
ggml_compute_forward_cross_entropy_loss,ggml_compute_forward_opt_step_adamw,ggml_compute_forward_opt_step_sgd
Unary operations (defined in unary-ops.h / unary-ops.cpp) implement element-wise functions:
- Activation functions:
relu,sigmoid,tanh,elu,hardsigmoid,hardswish,softplus,xielu - Mathematical functions:
abs,sgn,neg,sqr,sqrt,sin,cos,log,exp,expm1 - Rounding functions:
floor,ceil,round,trunc
Each operation accepts a ggml_compute_params structure that specifies the thread index and total thread count, enabling work partitioning across threads.
Usage
CPU tensor operations are used in virtually every GGML workload:
- Universal fallback: The CPU backend is always available and supports every operation. When a GPU backend does not implement a particular operation (or the operation's parameters fall outside what the GPU kernel supports), the backend scheduler automatically routes it to the CPU.
- CPU-only inference: On machines without GPUs or accelerators, the CPU backend handles the entire computation graph, making GGML functional on any platform with a C compiler.
- Mixed CPU-GPU execution: In multi-backend setups, the CPU handles operations that are not worth offloading to GPU (e.g., small element-wise operations, data rearrangement) while GPUs handle large matrix multiplications.
- Training workflows: Training-specific operations like cross-entropy loss computation and optimizer steps (AdamW, SGD) are implemented as CPU tensor operations.
Theoretical Basis
Complete Operation Coverage
A tensor computation framework must guarantee that every operation in its computation graph language can be executed. GGML achieves this by mandating that the CPU backend implements every defined operation. This design follows the universal backend pattern: one backend is designated as the fallback that can execute any operation, while specialized backends (GPU, accelerator) may implement subsets for better performance. The backend scheduler relies on this guarantee to always find at least one capable backend for each node.
Thread-Parallel Execution
Each CPU tensor operation is designed for parallel execution across multiple threads. The ggml_compute_params structure provides ith (current thread index) and nth (total thread count), and each kernel partitions its work accordingly -- typically by dividing the outermost loop dimension among threads. This follows the fork-join parallelism model where the thread pool is managed externally and each operation kernel only needs to compute its own slice.
Cache-Aware Design
The CPU operations header defines CACHE_LINE_SIZE (64 bytes for most architectures, 128 for POWER9, 256 for s390x VXE) and CACHE_LINE_SIZE_F32 to enable cache-line-aligned data access patterns. Operations that accumulate partial results across threads use cache-line-aligned partitioning to avoid false sharing. The im2col work buffer (GGML_IM2COL_WORK_SIZE of 16 MiB) is pre-allocated to avoid repeated allocation during convolution operations.
Operation Composition
Many complex operations are decomposed into simpler primitives internally. For example, grouped convolutions use im2col (image-to-column) transformations to convert convolution into matrix multiplication, and softmax is built on vectorized exponential and summation primitives. This compositional approach reduces code duplication while leveraging the highly optimized lower-level primitives.