Heuristic:OpenGVLab InternVL Loss Reduction Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, Training |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Three loss reduction strategies (`token`, `sample`, `square`) for balancing training loss across variable-length responses, with `square` (1/sqrt) recommended for balanced learning.
Description
InternVL provides three strategies for weighting the loss contribution of each training sample based on the number of active (non-ignored) tokens. This addresses the problem that longer responses naturally dominate the loss when using standard token-level averaging, which can bias the model toward verbose outputs. The `len2weight` function maps sample length to a loss weight using one of three reduction modes.
Usage
Apply this heuristic when fine-tuning InternVL models and you observe that the model either produces overly verbose or overly terse responses. Use the `--loss_reduction` training argument to select the strategy. The `square` strategy is recommended as the default for balanced training across response lengths.
The Insight (Rule of Thumb)
- Action: Set `--loss_reduction square` in training arguments.
- Value: Three options:
- `token`: Weight = 1 (standard per-token averaging, default)
- `sample`: Weight = 1/length (equal per-sample, short samples dominate)
- `square`: Weight = 1/sqrt(length) (balanced compromise)
- Trade-off: `token` favors long responses; `sample` favors short responses; `square` provides a balanced middle ground.
Reasoning
In multimodal training, response lengths vary dramatically (from single-word VQA answers to multi-paragraph descriptions). Standard token-level loss averaging (`token` mode) means long responses contribute disproportionately to the gradient, potentially biasing the model. The `square` mode uses 1/sqrt(length) weighting, which gives moderate preference to shorter samples while not completely ignoring the signal from longer ones. This provides a geometric mean between the two extremes.
Code Evidence
From `internvl_chat_finetune.py:786-795`:
def len2weight(x, loss_reduction):
if x == 0:
return x
if loss_reduction == 'token':
return 1
if loss_reduction == 'sample':
return 1 / x
if loss_reduction == 'square':
return 1 / (x ** 0.5)
raise NotImplementedError(loss_reduction)
Default value from `internvl_chat_finetune.py` ModelArguments dataclass:
loss_reduction: str = field(
default='token',
metadata={'help': 'Loss reduction method: token, sample, or square.'}
)