Heuristic:Lm sys FastChat Tokenizer Offset Correction

Knowledge Sources	lm-sys/FastChat Llama tokenizer offset behavior
Domains	NLP, Debugging
Last Updated	2026-02-07 04:00 GMT

Overview

Tokenizer-specific offset correction that subtracts 2 from instruction token lengths for LLaMA tokenizers and handles legacy vs non-legacy tokenizer mode differences.

Description

When creating training labels for SFT, FastChat must precisely mask user instructions (so loss is only computed on assistant responses). The LLaMA tokenizer adds special tokens (BOS, etc.) that create a consistent 2-token offset between the tokenized length of a substring and its actual position in the full tokenized sequence. An additional 1-token adjustment is needed for non-legacy mode tokenizers starting from the second conversation turn. If these offsets are wrong, the label mask misaligns and training data is silently corrupted.

Usage

Use this heuristic when debugging tokenization mismatches in conversation preprocessing or when adapting the training code for new models. The `-2` offset is hardcoded for the LLaMA tokenizer and may need adjustment for other tokenizer families.

The Insight (Rule of Thumb)

Action: Subtract 2 from `len(tokenizer(instruction_part).input_ids)` when computing instruction token lengths for label masking in SFT training.
Value: `-2` for LLaMA tokenizer; additional `-1` for non-legacy tokenizers after the first turn.
Trade-off: Hardcoded offset is fragile — if the tokenizer changes behavior (e.g., new special tokens), the offset breaks. A safety check discards samples where the computed length doesn't match the actual token count.

Reasoning

LLaMA's tokenizer prepends a BOS token and may add other special tokens when tokenizing a substring independently versus within a larger sequence. When you tokenize `"Human: What is 1+1?"` separately, you get 2 extra tokens compared to its span within the full conversation. The code compensates:

Tokenize the instruction part independently
Subtract 2 to account for BOS and tokenizer artifacts
For non-legacy tokenizers on turns after the first, subtract an additional 1

The safety net at the end catches any remaining mismatches: if `cur_len != total_len`, the entire sample is masked out (all labels set to `IGNORE_TOKEN_ID`) with a printed warning. This prevents silently training on misaligned labels, which would degrade model quality.

Code Evidence

Hardcoded -2 offset from `fastchat/train/train.py:142-143`:

# "-2" is hardcoded for the Llama tokenizer to make the offset correct.
instruction_len = len(tokenizer(parts[0]).input_ids) - 2

Non-legacy mode additional offset from `fastchat/train/train.py:145-155`:

if i != 0 and not tokenizer.legacy:
    # The legacy and non-legacy modes handle special tokens differently
    instruction_len -= 1
    ...
if i != 0 and not tokenizer.legacy:
    # The legacy and non-legacy modes handle special tokens differently
    cur_len -= 1

Safety check for mismatches from `fastchat/train/train.py:165-171`:

if cur_len < tokenizer.model_max_length:
    if cur_len != total_len:
        target[:] = IGNORE_TOKEN_ID
        rank0_print(
            f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
            f" #turn = {len(turns) - 1}. (ignored)"
        )

Debug inspection block from `fastchat/train/train.py:159-163`:

if False:  # Inspect and check the correctness of masking
    z = target.clone()
    z = torch.where(z == IGNORE_TOKEN_ID, tokenizer.unk_token_id, z)
    rank0_print(tokenizer.decode(z))
    exit()

Related Pages

Implementation:Lm_sys_FastChat_Preprocess_Conversation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment