Heuristic:Lm sys FastChat Tokenizer Offset Correction
| Knowledge Sources | |
|---|---|
| Domains | NLP, Debugging |
| Last Updated | 2026-02-07 04:00 GMT |
Overview
Tokenizer-specific offset correction that subtracts 2 from instruction token lengths for LLaMA tokenizers and handles legacy vs non-legacy tokenizer mode differences.
Description
When creating training labels for SFT, FastChat must precisely mask user instructions (so loss is only computed on assistant responses). The LLaMA tokenizer adds special tokens (BOS, etc.) that create a consistent 2-token offset between the tokenized length of a substring and its actual position in the full tokenized sequence. An additional 1-token adjustment is needed for non-legacy mode tokenizers starting from the second conversation turn. If these offsets are wrong, the label mask misaligns and training data is silently corrupted.
Usage
Use this heuristic when debugging tokenization mismatches in conversation preprocessing or when adapting the training code for new models. The `-2` offset is hardcoded for the LLaMA tokenizer and may need adjustment for other tokenizer families.
The Insight (Rule of Thumb)
- Action: Subtract 2 from `len(tokenizer(instruction_part).input_ids)` when computing instruction token lengths for label masking in SFT training.
- Value: `-2` for LLaMA tokenizer; additional `-1` for non-legacy tokenizers after the first turn.
- Trade-off: Hardcoded offset is fragile — if the tokenizer changes behavior (e.g., new special tokens), the offset breaks. A safety check discards samples where the computed length doesn't match the actual token count.
Reasoning
LLaMA's tokenizer prepends a BOS token and may add other special tokens when tokenizing a substring independently versus within a larger sequence. When you tokenize `"Human: What is 1+1?"` separately, you get 2 extra tokens compared to its span within the full conversation. The code compensates:
- Tokenize the instruction part independently
- Subtract 2 to account for BOS and tokenizer artifacts
- For non-legacy tokenizers on turns after the first, subtract an additional 1
The safety net at the end catches any remaining mismatches: if `cur_len != total_len`, the entire sample is masked out (all labels set to `IGNORE_TOKEN_ID`) with a printed warning. This prevents silently training on misaligned labels, which would degrade model quality.
Code Evidence
Hardcoded -2 offset from `fastchat/train/train.py:142-143`:
# "-2" is hardcoded for the Llama tokenizer to make the offset correct.
instruction_len = len(tokenizer(parts[0]).input_ids) - 2
Non-legacy mode additional offset from `fastchat/train/train.py:145-155`:
if i != 0 and not tokenizer.legacy:
# The legacy and non-legacy modes handle special tokens differently
instruction_len -= 1
...
if i != 0 and not tokenizer.legacy:
# The legacy and non-legacy modes handle special tokens differently
cur_len -= 1
Safety check for mismatches from `fastchat/train/train.py:165-171`:
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_TOKEN_ID
rank0_print(
f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
f" #turn = {len(turns) - 1}. (ignored)"
)
Debug inspection block from `fastchat/train/train.py:159-163`:
if False: # Inspect and check the correctness of masking
z = target.clone()
z = torch.where(z == IGNORE_TOKEN_ID, tokenizer.unk_token_id, z)
rank0_print(tokenizer.decode(z))
exit()