Heuristic:LaurentMazare Tch rs Hidden Dimension Alignment
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Aligning MLP hidden dimensions to multiples of 256 improves GPU memory access patterns and compute throughput for transformer feed-forward layers.
Description
In the LLaMA transformer implementation, the MLP hidden dimension is computed as `8/3 * n_embd` (following the SwiGLU architecture) and then rounded up to the nearest multiple of 256. This alignment ensures that matrix multiplications in the feed-forward layers operate on tensor dimensions that are friendly to GPU memory coalescing and CUDA kernel tile sizes. Misaligned dimensions can cause partial warp utilization and memory bank conflicts, reducing throughput.
Usage
Apply this heuristic when implementing transformer feed-forward layers (MLP blocks), especially for LLM architectures that use SwiGLU or gated activations. The alignment is most impactful on NVIDIA GPUs where CUDA kernels are optimized for dimensions that are multiples of 64, 128, or 256.
The Insight (Rule of Thumb)
- Action: Round MLP hidden dimensions to the nearest multiple of 256 using `(n_hidden - 1) / 256 * 256 + 256`.
- Value: 256 (alignment boundary). The base hidden size is `8 * n_embd / 3`.
- Trade-off: Slightly increases parameter count (up to 255 extra neurons) in exchange for better GPU utilization.
- Compatibility: Beneficial on all NVIDIA GPUs. Marginal impact on CPU.
Reasoning
GPU matrix multiplication kernels (GEMM) divide work into tiles. When matrix dimensions align with tile sizes (commonly 64, 128, or 256), all tiles are fully utilized with no wasted computation. The LLaMA architecture uses a 2/3 scaling factor for its gated MLP (SwiGLU pattern), which produces non-round hidden dimensions. The alignment step corrects this. For a 4096-dimensional model, the raw hidden size would be 10922, which rounds to 11008 (a multiple of 256). This is the standard hidden dimension used in LLaMA-7B.
Code Evidence
Dimension alignment from `examples/llama/main.rs:132-133`:
let n_hidden = 8 * n_embd / 3;
let n_hidden = (n_hidden - 1) / 256 * 256 + 256;
The aligned dimension is used for the gated MLP (SwiGLU pattern) from `examples/llama/main.rs:134-138`:
let c = nn::LinearConfig { bias: false, ..Default::default() };
let c_fc1 = nn::linear(&vs / "c_fc1", n_embd, n_hidden, c);
let c_fc2 = nn::linear(&vs / "c_fc2", n_embd, n_hidden, c);
let c_proj = nn::linear(&vs / "c_proj", n_hidden, n_embd, c);