Implementation:Huggingface Peft Causal LM Data Pipeline
Metadata
- Source: examples/dora_finetuning/dora_finetuning.py:L92-103, examples/sft/utils.py:L48-83
- Repository: huggingface/peft
- Type: Pattern Doc
- Domains: NLP, Data_Preprocessing
Overview
This implementation documents the dataset preparation pattern used across PEFT causal language model examples. Two distinct patterns are demonstrated: (1) a direct tokenization pattern for simple text datasets, and (2) a chat template pattern for conversational instruction-following datasets. Both patterns produce tokenized datasets compatible with DataCollatorForLanguageModeling or SFTTrainer.
Pattern 1: Direct Tokenization (dora_finetuning example)
This pattern is used for plain text datasets where each example has a "text" field.
Source: examples/dora_finetuning/dora_finetuning.py:L92-103
from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling
# Load dataset from Hugging Face Hub or local path
dataset = load_dataset(data_path)
def tokenize_function(examples):
inputs = tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=cutoff_len,
)
inputs["labels"] = inputs["input_ids"].copy() # labels = input_ids for causal LM
return inputs
# Tokenize the dataset with batched processing
tokenized_datasets = dataset.map(
tokenize_function,
batched=True,
remove_columns=dataset["train"].column_names,
)
# Data collator for dynamic batching (mlm=False for causal LM)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
Step-by-Step Breakdown
- Load dataset:
load_dataset(data_path)loads from the Hugging Face Hub or a local path, returning aDatasetDictwith splits. - Define tokenize function: The function receives batched examples (a dict of lists), tokenizes the
"text"field with fixed-length padding and truncation, and copiesinput_idstolabels. - Apply tokenization:
dataset.map()withbatched=Trueprocesses examples in batches for efficiency.remove_columnsdrops the original text columns, leaving only numeric tensors. - Create data collator:
DataCollatorForLanguageModelingwithmlm=Falsehandles batch assembly for causal language modeling. It replaces padding tokens in labels with-100.
Pattern 2: Chat Template Processing (sft/utils.py example)
This pattern is used for conversational datasets with structured message lists.
Source: examples/sft/utils.py:L48-83
from datasets import DatasetDict, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError
def create_datasets(tokenizer, data_args, training_args, apply_chat_template=False):
def preprocess(samples):
batch = []
for conversation in samples["messages"]:
batch.append(tokenizer.apply_chat_template(conversation, tokenize=False))
return {"content": batch}
raw_datasets = DatasetDict()
for split in data_args.splits.split(","):
try:
# Try loading from Hugging Face Hub
dataset = load_dataset(data_args.dataset_name, split=split)
except DatasetGenerationError:
# Fall back to local disk
dataset = load_from_disk(os.path.join(data_args.dataset_name, split))
if "train" in split:
raw_datasets["train"] = dataset
elif "test" in split:
raw_datasets["test"] = dataset
if apply_chat_template:
raw_datasets = raw_datasets.map(
preprocess,
batched=True,
remove_columns=raw_datasets["train"].column_names,
)
train_data = raw_datasets["train"]
valid_data = raw_datasets["test"]
return train_data, valid_data
Step-by-Step Breakdown
- Define preprocessing function: For each conversation in the batch,
tokenizer.apply_chat_template()converts the structured message list into a formatted string according to the tokenizer's configured chat template (ChatML, Zephyr, etc.). - Load datasets with fallback: Attempts to load from Hub first; if that fails, falls back to loading from local disk. Handles both
"train"and"test"splits. - Apply chat template: When
apply_chat_template=True, the structured messages are converted to flat text. The original columns are removed, leaving a single"content"column. - Return split datasets: The function returns separate train and validation datasets ready for SFTTrainer consumption.
Chat Template Configuration
The chat template is configured on the tokenizer before dataset creation. From examples/sft/utils.py:
# ChatML template
DEFAULT_CHATML_CHAT_TEMPLATE = (
"{% for message in messages %}\n"
"{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}"
"{% if loop.last and add_generation_prompt %}"
"{{'<|im_start|>assistant\n'}}"
"{% endif %}{% endfor %}"
)
# Zephyr template
DEFAULT_ZEPHYR_CHAT_TEMPLATE = (
"{% for message in messages %}\n"
"{% if message['role'] == 'user' %}\n"
"{{ '<|user|>\n' + message['content'] + eos_token }}\n"
"{% elif message['role'] == 'system' %}\n"
"{{ '<|system|>\n' + message['content'] + eos_token }}\n"
"{% elif message['role'] == 'assistant' %}\n"
"{{ '<|assistant|>\n' + message['content'] + eos_token }}\n"
"{% endif %}\n{% endfor %}"
)
# Apply to tokenizer
tokenizer.chat_template = chat_template
Key Parameters
| Parameter | Description | Typical Values |
|---|---|---|
max_length / cutoff_len |
Maximum sequence length for truncation | 512, 1024, 2048, 4096 |
padding |
Padding strategy | "max_length" (fixed), True (dynamic) |
truncation |
Whether to truncate long sequences | True |
batched |
Process examples in batches during map | True |
remove_columns |
Columns to drop after tokenization | Original text column names |
mlm |
Masked language modeling flag (False for causal LM) | False |
Design Decisions
- Labels equal input_ids: For causal LM, labels are simply a copy of input_ids. The model handles the one-position shift internally during loss computation.
- Fixed vs. dynamic padding: The dora_finetuning example uses fixed-length padding (
padding="max_length") during tokenization for simplicity. The SFT example delegates padding to the SFTTrainer/data collator for better memory efficiency. - Hub-first loading with fallback: The SFT utils attempt Hub loading first and fall back to local disk, enabling seamless use with both remote and local datasets.
- Chat template as preprocessing: The chat template is applied as a text transformation before tokenization, producing a flat text string that is then tokenized normally by the SFTTrainer.