Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Intel Ipex llm Use Cache Training Vs Inference

From Leeroopedia



Knowledge Sources
Domains LLM_Finetuning, Debugging
Last Updated 2026-02-09 12:00 GMT

Overview

Disable KV cache (`use_cache=False`) during training to prevent gradient flow issues; re-enable for inference to accelerate generation.

Description

The KV (Key-Value) cache stores previously computed attention key and value states to avoid redundant computation during autoregressive generation. While essential for fast inference, the KV cache is incompatible with training because cached values do not participate in gradient computation, leading to incorrect gradient flow and training bugs. The IPEX-LLM examples explicitly disable this cache before training and include comments reminding users to re-enable it for inference.

Usage

Use this heuristic before every training run (QLoRA, LoRA, or DPO). Disable `use_cache` after model loading and before calling `trainer.train()`. Re-enable it when switching to inference mode. Also applies when gradient checkpointing is enabled, as the two features are fundamentally incompatible.

The Insight (Rule of Thumb)

  • Action: Set `model.config.use_cache = False` before training begins.
  • Action: Re-enable `model.config.use_cache = True` before inference/generation.
  • Value: Boolean flag on model config.
  • Trade-off: Training without cache is correct but means training step speed is unaffected. Inference without cache is significantly slower (no KV reuse).

Reasoning

During autoregressive training, the model processes the full sequence at once (teacher forcing), so there is no benefit from caching previous positions. Furthermore, cached tensors bypass the autograd graph, so gradients do not flow through them. If `use_cache=True` during training, the loss computation is incorrect because some computations are detached from the gradient graph. For inference, the cache enables O(n) generation instead of O(n^2) by reusing previously computed attention states.

Code Evidence

Disabling cache before training from `alpaca_qlora_finetuning.py:276`:

model.config.use_cache = False

Same in LoRA training from `alpaca_lora_finetuning.py:263`:

model.config.use_cache = False

Same in DPO training from `dpo_finetuning.py:134`:

model.config.use_cache = False

Enabling cache for pipeline parallel inference from `generate.py:51`:

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_low_bit=low_bit,
                                             optimize_model=True,
                                             trust_remote_code=True,
                                             use_cache=True,
                                             torch_dtype=torch.float16,
                                             pipeline_parallel_stages=args.gpu_num)

Missing keys warning (benign after LoRA training) from `alpaca_qlora_finetuning.py:282-284`:

print(
    "\n If there's a warning about missing keys above, please disregard :)"
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment