Heuristic:Microsoft BIPIA LLAMA Pad Token Workaround

Knowledge Sources	Transformers Issue #22312 Microsoft BIPIA
Domains	LLMs, Debugging
Last Updated	2026-02-14 15:00 GMT

Overview

Workaround for LLAMA models missing a pad token by assigning the unknown token (`<unk>`, id=0) as the pad token, with left-side padding for batched inference.

Description

LLAMA-family models (Llama, Alpaca, Vicuna, Baize, Guanaco, etc.) do not define a pad token in their tokenizer configuration. This causes errors during batched inference when sequences of different lengths need to be padded. The BIPIA codebase addresses this by explicitly setting `pad_token = "<unk>"` with `pad_token_id = 0` for LLAMA models, and using a generic `<|padding|>` token with id=1 for other model families (Dolly, StableLM, MPT, Mistral, OASST, RWKV). All models use left-side padding (`padding_side = "left"`) to ensure the generation tokens are properly aligned at the end of the sequence.

Usage

This heuristic is critical when adding new models to the BIPIA benchmark or debugging tokenization errors. If a new model raises errors about missing pad tokens during batched inference, check whether its tokenizer defines a pad token and add the appropriate workaround. The choice of pad token ID matters: using the EOS token as pad token can cause premature generation termination.

The Insight (Rule of Thumb)

Action: For LLAMA models, set `pad_token = "<unk>"` and `pad_token_id = 0`. For non-LLAMA models missing a pad token, set `pad_token = "<|padding|>"` and `pad_token_id = 1`.
Value: `pad_token_id = 0` (unk) is chosen specifically to be different from the EOS token, preventing generation issues.
Trade-off: Using the unk token as padding means the model sees unk tokens in padded positions, but since attention masks exclude these positions, this has no effect on generation quality.
Additional: Always set `padding_side = "left"` for causal LM inference to keep generated tokens properly aligned.

Reasoning

As documented in HuggingFace Transformers issue #22312, LLAMA tokenizers deliberately omit a pad token because the original LLAMA training did not use padding. For inference with batching, a pad token is essential. The unk token (id=0) is the safest choice because: (1) it is never generated during normal inference, (2) it is distinct from the EOS token (which would cause premature stopping), and (3) it exists in all LLAMA tokenizer vocabularies. Left-side padding is required for causal language models because generation must start from the last non-padded token.

Code Evidence

LLAMA pad token workaround from `bipia/model/llama.py:27-41`:

class LLAMAModel(vLLMModel):
    def load_tokenizer(self):
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.config["model_name"], use_fast=False
        )
        if self.tokenizer.pad_token is None:
            # LLAMA doesnot have pad token (https://github.com/huggingface/transformers/issues/22312)
            self.tokenizer.pad_token = "<unk>"
            self.tokenizer.pad_token_id = (
                0  # unk. we want this to be different from the eos token
            )
        self.tokenizer.padding_side = "left"  # Allow batched inference
        return self.tokenizer

Generic pad token workaround for other models from `bipia/model/vllm_worker.py:112-114`:

if self.tokenizer.pad_token is None:
    self.tokenizer.pad_token = "<|padding|>"
    self.tokenizer.pad_token_id = 1

Related Pages

Implementation:Microsoft_BIPIA_AutoLLM

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment