Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Intel Ipex llm Llama Padding Token Workaround

From Leeroopedia




Knowledge Sources
Domains LLM_Finetuning, Debugging
Last Updated 2026-02-09 12:00 GMT

Overview

Llama-family models lack a native padding token; set `pad_token = eos_token` to prevent batched training failures, and `group_by_length=False` to keep loss curves interpretable.

Description

Meta's Llama family of models (Llama 2, Llama 3, Code Llama) do not define a dedicated padding token in their tokenizer vocabulary. This causes errors when batch training requires padding sequences to equal length. The standard workaround used across IPEX-LLM examples is to assign the end-of-sequence token as the padding token. Additionally, `group_by_length` (which groups similar-length sequences in batches for efficiency) is disabled by default because it produces irregular, hard-to-interpret loss curves that make it difficult to diagnose training problems.

Usage

Use this heuristic when finetuning any Llama-family model (Llama 2, Llama 3, Code Llama, etc.) with IPEX-LLM. Apply the pad token fix immediately after tokenizer loading. Consider keeping `group_by_length=False` unless training speed is the top priority.

The Insight (Rule of Thumb)

  • Action: After loading tokenizer, check and set: `if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token`
  • Action: For DPO, also set `tokenizer.padding_side = "left"` to align with causal LM generation direction.
  • Action: Keep `group_by_length=False` (default) for interpretable loss curves.
  • Trade-off: `group_by_length=True` is faster (less padding waste) but produces irregular loss curves that make debugging harder.

Reasoning

Llama tokenizers were designed for single-sequence inference and do not include a pad token. During batch training, the data collator must pad shorter sequences, which requires a pad token ID. Using the EOS token as padding is a widely accepted workaround that works because: (1) the attention mask zeros out padded positions, and (2) EOS tokens in padded positions do not affect loss computation. For DPO, left-side padding ensures the model's generation outputs are right-aligned, which is necessary for correct loss computation on preference pairs.

Code Evidence

Pad token fix from `alpaca_qlora_finetuning.py:204-206`:

# For Llama family
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

DPO padding setup from `dpo_finetuning.py:95-96`:

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

group_by_length default from `alpaca_qlora_finetuning.py:99`:

group_by_length: bool = False,  # faster, but produces an odd training loss curve

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment