Heuristic:Intel Ipex llm Use Cache Training Vs Inference
| Knowledge Sources | |
|---|---|
| Domains | LLM_Finetuning, Debugging |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Disable KV cache (`use_cache=False`) during training to prevent gradient flow issues; re-enable for inference to accelerate generation.
Description
The KV (Key-Value) cache stores previously computed attention key and value states to avoid redundant computation during autoregressive generation. While essential for fast inference, the KV cache is incompatible with training because cached values do not participate in gradient computation, leading to incorrect gradient flow and training bugs. The IPEX-LLM examples explicitly disable this cache before training and include comments reminding users to re-enable it for inference.
Usage
Use this heuristic before every training run (QLoRA, LoRA, or DPO). Disable `use_cache` after model loading and before calling `trainer.train()`. Re-enable it when switching to inference mode. Also applies when gradient checkpointing is enabled, as the two features are fundamentally incompatible.
The Insight (Rule of Thumb)
- Action: Set `model.config.use_cache = False` before training begins.
- Action: Re-enable `model.config.use_cache = True` before inference/generation.
- Value: Boolean flag on model config.
- Trade-off: Training without cache is correct but means training step speed is unaffected. Inference without cache is significantly slower (no KV reuse).
Reasoning
During autoregressive training, the model processes the full sequence at once (teacher forcing), so there is no benefit from caching previous positions. Furthermore, cached tensors bypass the autograd graph, so gradients do not flow through them. If `use_cache=True` during training, the loss computation is incorrect because some computations are detached from the gradient graph. For inference, the cache enables O(n) generation instead of O(n^2) by reusing previously computed attention states.
Code Evidence
Disabling cache before training from `alpaca_qlora_finetuning.py:276`:
model.config.use_cache = False
Same in LoRA training from `alpaca_lora_finetuning.py:263`:
model.config.use_cache = False
Same in DPO training from `dpo_finetuning.py:134`:
model.config.use_cache = False
Enabling cache for pipeline parallel inference from `generate.py:51`:
model = AutoModelForCausalLM.from_pretrained(model_path,
load_in_low_bit=low_bit,
optimize_model=True,
trust_remote_code=True,
use_cache=True,
torch_dtype=torch.float16,
pipeline_parallel_stages=args.gpu_num)
Missing keys warning (benign after LoRA training) from `alpaca_qlora_finetuning.py:282-284`:
print(
"\n If there's a warning about missing keys above, please disregard :)"
)
Related Pages
- Implementation:Intel_Ipex_llm_Transformers_Trainer_QLoRA
- Implementation:Intel_Ipex_llm_Transformers_Trainer_LoRA
- Implementation:Intel_Ipex_llm_DPOTrainer_Usage
- Implementation:Intel_Ipex_llm_Model_Generate_PP
- Principle:Intel_Ipex_llm_Training_With_HF_Trainer_QLoRA
- Principle:Intel_Ipex_llm_Training_With_HF_Trainer_LoRA
- Principle:Intel_Ipex_llm_Pipeline_Parallel_Generation