Heuristic:Intel Ipex llm Llama Padding Token Workaround
| Knowledge Sources | |
|---|---|
| Domains | LLM_Finetuning, Debugging |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Llama-family models lack a native padding token; set `pad_token = eos_token` to prevent batched training failures, and `group_by_length=False` to keep loss curves interpretable.
Description
Meta's Llama family of models (Llama 2, Llama 3, Code Llama) do not define a dedicated padding token in their tokenizer vocabulary. This causes errors when batch training requires padding sequences to equal length. The standard workaround used across IPEX-LLM examples is to assign the end-of-sequence token as the padding token. Additionally, `group_by_length` (which groups similar-length sequences in batches for efficiency) is disabled by default because it produces irregular, hard-to-interpret loss curves that make it difficult to diagnose training problems.
Usage
Use this heuristic when finetuning any Llama-family model (Llama 2, Llama 3, Code Llama, etc.) with IPEX-LLM. Apply the pad token fix immediately after tokenizer loading. Consider keeping `group_by_length=False` unless training speed is the top priority.
The Insight (Rule of Thumb)
- Action: After loading tokenizer, check and set: `if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token`
- Action: For DPO, also set `tokenizer.padding_side = "left"` to align with causal LM generation direction.
- Action: Keep `group_by_length=False` (default) for interpretable loss curves.
- Trade-off: `group_by_length=True` is faster (less padding waste) but produces irregular loss curves that make debugging harder.
Reasoning
Llama tokenizers were designed for single-sequence inference and do not include a pad token. During batch training, the data collator must pad shorter sequences, which requires a pad token ID. Using the EOS token as padding is a widely accepted workaround that works because: (1) the attention mask zeros out padded positions, and (2) EOS tokens in padded positions do not affect loss computation. For DPO, left-side padding ensures the model's generation outputs are right-aligned, which is necessary for correct loss computation on preference pairs.
Code Evidence
Pad token fix from `alpaca_qlora_finetuning.py:204-206`:
# For Llama family
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
DPO padding setup from `dpo_finetuning.py:95-96`:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
group_by_length default from `alpaca_qlora_finetuning.py:99`:
group_by_length: bool = False, # faster, but produces an odd training loss curve