Principle:LLMBook zh LLMBook zh github io Batch Data Collation

Knowledge Sources	HuggingFace Data Collators LLMBook-zh
Domains	NLP, Data_Engineering
Last Updated	2026-02-08 00:00 GMT

Overview

A batching technique that pads variable-length sequences to equal length within a batch, using appropriate padding values for inputs and labels.

Description

Batch Data Collation addresses the fundamental requirement of batched neural network training: all sequences in a batch must have the same length. For supervised fine-tuning, this means padding input_ids with the tokenizer's pad_token_id and padding labels with IGNORE_INDEX (-100) so that padding positions do not contribute to the loss.

Usage

Use this principle when training with variable-length sequence data. Pass the collator to the Trainer's data_collator argument.

Theoretical Basis

Given a batch of sequences with varying lengths:

Find the maximum length in the batch.
Pad all input_ids to max length with pad_token_id.
Pad all labels to max length with IGNORE_INDEX (-100).
Stack into batch tensors.

Related Pages

Implemented By

Implementation:LLMBook_zh_LLMBook_zh_github_io_DataCollatorForSupervisedDataset

Uses Heuristic

Heuristic:LLMBook_zh_LLMBook_zh_github_io_IGNORE_INDEX_Loss_Masking

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment