Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Haotian liu LLaVA Tokenizer Version Offset Correction

From Leeroopedia
Knowledge Sources
Domains NLP, Debugging
Last Updated 2026-02-13 23:00 GMT

Overview

Apply a token offset correction of -1 when using `tokenizers >= 0.14` with non-legacy tokenizers to prevent tokenization mismatch warnings and silent label corruption during training.

Description

The HuggingFace `tokenizers` library changed its behavior starting in version 0.14: non-legacy tokenizers add an extra token at the beginning of tokenized sequences. LLaVA explicitly detects this version boundary with `IS_TOKENIZER_GREATER_THAN_0_14` and applies a -1 offset correction to `round_len` and `instruction_len` during conversation tokenization for the v1 and MPT preprocessing paths. Without this correction, the label masking logic would be off by one token, causing either silent training degradation or explicit "tokenization mismatch" warnings.

Usage

Apply this heuristic when modifying any tokenization or label masking code in LLaVA. The version check is automatic in the existing code, but custom tokenization code must also account for this offset. The correction applies specifically to multi-round conversations where individual rounds are tokenized separately.

The Insight (Rule of Thumb)

  • Action: Check `IS_TOKENIZER_GREATER_THAN_0_14` before computing round-level token lengths. Apply -1 offset to both `round_len` and `instruction_len` for non-first rounds when using `tokenizers >= 0.14` with non-legacy tokenizers.
  • Value: Prevents off-by-one errors in label masking, ensuring correct loss computation.
  • Trade-off: None — this is a correctness fix, not a trade-off.
  • Symptom: Without correction, training will print `WARNING: tokenization mismatch: {cur_len} vs. {total_len}. (ignored)` and set all labels to `IGNORE_INDEX`, effectively discarding that training sample.

Reasoning

The tokenizer behavior change in v0.14 means that when tokenizing substrings individually (as done for multi-round conversations), each tokenized round gets an extra BOS-like token. When these are concatenated during label masking, the offsets accumulate incorrectly. The fix accounts for this by subtracting 1 from the computed lengths for rounds after the first one. The code also handles the inverse case for MPT models where `tokenizer.legacy=True` causes the opposite behavior.

The fallback warning mechanism (`target[:] = IGNORE_INDEX` when `cur_len != total_len`) is a safety net that discards samples with mismatched tokenization rather than training on corrupted labels.

Code Evidence

Version detection from `train.py:49-50`:

from packaging import version
IS_TOKENIZER_GREATER_THAN_0_14 = version.parse(tokenizers.__version__) >= version.parse('0.14')

v1 preprocessing offset correction from `train.py:477-479`:

if i != 0 and not tokenizer.legacy and IS_TOKENIZER_GREATER_THAN_0_14:
    round_len -= 1
    instruction_len -= 1

MPT preprocessing inverse correction from `train.py:565-567`:

if i != 0 and getattr(tokenizer, 'legacy', False) and IS_TOKENIZER_GREATER_THAN_0_14:
    round_len += 1
    instruction_len += 1

Mismatch safety warning from `train.py:486-492`:

if cur_len < tokenizer.model_max_length:
    if cur_len != total_len:
        target[:] = IGNORE_INDEX
        print(
            f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
            f" (ignored)"
        )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment