Principle:LLMBook zh LLMBook zh github io Batch Data Collation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
A batching technique that pads variable-length sequences to equal length within a batch, using appropriate padding values for inputs and labels.
Description
Batch Data Collation addresses the fundamental requirement of batched neural network training: all sequences in a batch must have the same length. For supervised fine-tuning, this means padding input_ids with the tokenizer's pad_token_id and padding labels with IGNORE_INDEX (-100) so that padding positions do not contribute to the loss.
Usage
Use this principle when training with variable-length sequence data. Pass the collator to the Trainer's data_collator argument.
Theoretical Basis
Given a batch of sequences with varying lengths:
- Find the maximum length in the batch.
- Pad all input_ids to max length with pad_token_id.
- Pad all labels to max length with IGNORE_INDEX (-100).
- Stack into batch tensors.