Heuristic:Huggingface Alignment handbook Liger Kernel Memory
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Enable Liger Kernel for fused Triton operations that reduce GPU memory usage and improve training throughput on long sequences.
Description
Liger Kernel provides fused Triton kernels for common transformer operations (cross-entropy, RMS norm, SwiGLU, etc.) that are more memory-efficient than standard PyTorch implementations. The alignment-handbook uses Liger Kernel in all SmolLM3 recipes (mid-training, SFT, and DPO) to enable training with very long sequences (up to 65536 tokens) on limited GPU memory.
Usage
Apply this when training with long sequences (8k+ tokens) or when GPU memory is constrained. Particularly beneficial for large-scale SFT and DPO training with models like SmolLM3.
The Insight (Rule of Thumb)
- Action: Set `use_liger_kernel: true` in the training config.
- Value: Reduces peak memory usage, enabling longer sequences or larger effective batch sizes.
- Trade-off: Requires `liger-kernel` >= 0.6.0 package. May have minimal overhead from kernel compilation on first run.
Reasoning
Standard PyTorch operations materialize intermediate tensors (e.g., full logit matrix for cross-entropy), which is the dominant memory cost for long sequences. Liger Kernel fuses these operations in Triton, computing results without materializing the full intermediate tensor. This is critical for the SmolLM3 SFT recipe which trains with max_length=65536.
SmolLM3 SFT config from `recipes/smollm3/sft/sft.yaml:225`:
use_liger_kernel: true
SmolLM3 mid-training config from `recipes/smollm3/sft/mid.yaml:61`:
use_liger_kernel: true
SmolLM3 APO-Zero config from `recipes/smollm3/dpo/apo.yaml:65`:
use_liger_kernel: true
Liger Kernel version requirement from `setup.py:55`:
"liger-kernel>=0.6.0",