Principle:Huggingface Peft Causal LM Dataset Preparation
Metadata
- Sources: Hugging Face Data Collator Documentation, Hugging Face Datasets Processing Guide
- Domains: NLP, Data_Preprocessing
Overview
Causal LM Dataset Preparation covers the principles and techniques for transforming raw text or conversational data into the tokenized, batched format required by causal language model fine-tuning. This process involves tokenization, sequence length management (truncation and padding), label construction, and efficient data collation. Correct dataset preparation is essential for stable training and proper loss computation.
Theoretical Foundation
Causal Language Modeling Objective
In causal (autoregressive) language modeling, the model predicts each token given all preceding tokens. The training objective is to minimize the cross-entropy loss:
L = -sum_{t=1}^{T} log P(x_t | x_1, ..., x_{t-1})
For this objective, the labels are identical to the input_ids, but shifted by one position. Most Hugging Face causal LM implementations (e.g., GPT2LMHeadModel, LlamaForCausalLM) handle this shift internally -- the model takes input_ids and labels as the same sequence, and internally shifts labels right by one position before computing the loss. Therefore, when preparing data, labels should be set equal to input_ids.
Tokenization
Tokenization converts raw text into integer token IDs using a model-specific vocabulary. Key considerations:
- Maximum sequence length: Models have a fixed context window. Sequences exceeding this length must be truncated. The
max_lengthparameter controls this cutoff. - Truncation: Sequences longer than
max_lengthare clipped to fit. For causal LM, truncation typically occurs at the end of the sequence. - Padding: Sequences shorter than
max_lengthare padded with a specialpad_tokento enable batching. Padding can be applied to a fixed length (padding="max_length") during tokenization or dynamically during collation.
Data Collation
A data collator is responsible for assembling individual tokenized examples into batches. For causal language modeling:
DataCollatorForLanguageModelingwithmlm=Falsehandles batching for causal LM- It dynamically pads sequences to the longest length in the batch (when padding is not applied during tokenization)
- It creates the
labelstensor frominput_ids, replacing padding tokens with-100so they are ignored by the cross-entropy loss
Chat Template Application
For instruction-following or conversational data, chat templates transform structured message lists into flat text sequences before tokenization:
# Structured conversation
messages = [
{"role": "user", "content": "What is PEFT?"},
{"role": "assistant", "content": "PEFT stands for Parameter-Efficient Fine-Tuning..."}
]
# After template application (ChatML format)
text = "<|im_start|>user\nWhat is PEFT?<|im_end|>\n<|im_start|>assistant\nPEFT stands for..."
The template application step must occur before tokenization and may require adding special tokens to the tokenizer vocabulary and resizing model embeddings.
Key Concepts
- Label Masking: Padding tokens in labels are replaced with
-100, which is the default ignore index in PyTorch'sCrossEntropyLoss. This ensures the model is not penalized for predicting padding tokens. - Column Removal: After tokenization, the original text columns are removed from the dataset via
remove_columnsto keep only the numeric tensor columns (input_ids,attention_mask,labels). - Batched Processing: The
dataset.map()function withbatched=Trueprocesses multiple examples simultaneously, which is significantly faster than processing one example at a time. - EOS Token as Pad Token: Many causal LM models do not have a dedicated pad token. A common convention is to set the pad token equal to the EOS (end-of-sequence) token:
tokenizer.pad_token = tokenizer.eos_token.
Practical Implications
- Always set
mlm=FalseinDataCollatorForLanguageModelingfor causal LM -- themlm=Truesetting is for masked language models (BERT-style) - Choose
max_lengthcarefully: too short truncates important context, too long wastes memory on padding - For PEFT fine-tuning, dataset preparation is identical to full fine-tuning -- the adapter only affects model architecture, not data processing
- When using chat templates, ensure special tokens are added to the tokenizer before tokenization and that model embeddings are resized to match
- Dynamic padding (via the data collator) is generally more memory-efficient than fixed-length padding during tokenization