Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LLMBook zh LLMBook zh github io Batch Data Collation

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Engineering
Last Updated 2026-02-08 00:00 GMT

Overview

A batching technique that pads variable-length sequences to equal length within a batch, using appropriate padding values for inputs and labels.

Description

Batch Data Collation addresses the fundamental requirement of batched neural network training: all sequences in a batch must have the same length. For supervised fine-tuning, this means padding input_ids with the tokenizer's pad_token_id and padding labels with IGNORE_INDEX (-100) so that padding positions do not contribute to the loss.

Usage

Use this principle when training with variable-length sequence data. Pass the collator to the Trainer's data_collator argument.

Theoretical Basis

Given a batch of sequences with varying lengths:

  1. Find the maximum length in the batch.
  2. Pad all input_ids to max length with pad_token_id.
  3. Pad all labels to max length with IGNORE_INDEX (-100).
  4. Stack into batch tensors.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment