Heuristic:Haotian liu LLaVA Tokenizer Version Offset Correction
| Knowledge Sources | |
|---|---|
| Domains | NLP, Debugging |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
Apply a token offset correction of -1 when using `tokenizers >= 0.14` with non-legacy tokenizers to prevent tokenization mismatch warnings and silent label corruption during training.
Description
The HuggingFace `tokenizers` library changed its behavior starting in version 0.14: non-legacy tokenizers add an extra token at the beginning of tokenized sequences. LLaVA explicitly detects this version boundary with `IS_TOKENIZER_GREATER_THAN_0_14` and applies a -1 offset correction to `round_len` and `instruction_len` during conversation tokenization for the v1 and MPT preprocessing paths. Without this correction, the label masking logic would be off by one token, causing either silent training degradation or explicit "tokenization mismatch" warnings.
Usage
Apply this heuristic when modifying any tokenization or label masking code in LLaVA. The version check is automatic in the existing code, but custom tokenization code must also account for this offset. The correction applies specifically to multi-round conversations where individual rounds are tokenized separately.
The Insight (Rule of Thumb)
- Action: Check `IS_TOKENIZER_GREATER_THAN_0_14` before computing round-level token lengths. Apply -1 offset to both `round_len` and `instruction_len` for non-first rounds when using `tokenizers >= 0.14` with non-legacy tokenizers.
- Value: Prevents off-by-one errors in label masking, ensuring correct loss computation.
- Trade-off: None — this is a correctness fix, not a trade-off.
- Symptom: Without correction, training will print `WARNING: tokenization mismatch: {cur_len} vs. {total_len}. (ignored)` and set all labels to `IGNORE_INDEX`, effectively discarding that training sample.
Reasoning
The tokenizer behavior change in v0.14 means that when tokenizing substrings individually (as done for multi-round conversations), each tokenized round gets an extra BOS-like token. When these are concatenated during label masking, the offsets accumulate incorrectly. The fix accounts for this by subtracting 1 from the computed lengths for rounds after the first one. The code also handles the inverse case for MPT models where `tokenizer.legacy=True` causes the opposite behavior.
The fallback warning mechanism (`target[:] = IGNORE_INDEX` when `cur_len != total_len`) is a safety net that discards samples with mismatched tokenization rather than training on corrupted labels.
Code Evidence
Version detection from `train.py:49-50`:
from packaging import version
IS_TOKENIZER_GREATER_THAN_0_14 = version.parse(tokenizers.__version__) >= version.parse('0.14')
v1 preprocessing offset correction from `train.py:477-479`:
if i != 0 and not tokenizer.legacy and IS_TOKENIZER_GREATER_THAN_0_14:
round_len -= 1
instruction_len -= 1
MPT preprocessing inverse correction from `train.py:565-567`:
if i != 0 and getattr(tokenizer, 'legacy', False) and IS_TOKENIZER_GREATER_THAN_0_14:
round_len += 1
instruction_len += 1
Mismatch safety warning from `train.py:486-492`:
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(
f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
f" (ignored)"
)