Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:OpenGVLab InternVL Loss Reduction Strategy

From Leeroopedia



Knowledge Sources
Domains Optimization, LLMs, Training
Last Updated 2026-02-07 14:00 GMT

Overview

Three loss reduction strategies (`token`, `sample`, `square`) for balancing training loss across variable-length responses, with `square` (1/sqrt) recommended for balanced learning.

Description

InternVL provides three strategies for weighting the loss contribution of each training sample based on the number of active (non-ignored) tokens. This addresses the problem that longer responses naturally dominate the loss when using standard token-level averaging, which can bias the model toward verbose outputs. The `len2weight` function maps sample length to a loss weight using one of three reduction modes.

Usage

Apply this heuristic when fine-tuning InternVL models and you observe that the model either produces overly verbose or overly terse responses. Use the `--loss_reduction` training argument to select the strategy. The `square` strategy is recommended as the default for balanced training across response lengths.

The Insight (Rule of Thumb)

  • Action: Set `--loss_reduction square` in training arguments.
  • Value: Three options:
    • `token`: Weight = 1 (standard per-token averaging, default)
    • `sample`: Weight = 1/length (equal per-sample, short samples dominate)
    • `square`: Weight = 1/sqrt(length) (balanced compromise)
  • Trade-off: `token` favors long responses; `sample` favors short responses; `square` provides a balanced middle ground.

Reasoning

In multimodal training, response lengths vary dramatically (from single-word VQA answers to multi-paragraph descriptions). Standard token-level loss averaging (`token` mode) means long responses contribute disproportionately to the gradient, potentially biasing the model. The `square` mode uses 1/sqrt(length) weighting, which gives moderate preference to shorter samples while not completely ignoring the signal from longer ones. This provides a geometric mean between the two extremes.

Code Evidence

From `internvl_chat_finetune.py:786-795`:

def len2weight(x, loss_reduction):
    if x == 0:
        return x
    if loss_reduction == 'token':
        return 1
    if loss_reduction == 'sample':
        return 1 / x
    if loss_reduction == 'square':
        return 1 / (x ** 0.5)
    raise NotImplementedError(loss_reduction)

Default value from `internvl_chat_finetune.py` ModelArguments dataclass:

loss_reduction: str = field(
    default='token',
    metadata={'help': 'Loss reduction method: token, sample, or square.'}
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment