Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Peft Causal LM Data Pipeline

From Leeroopedia


Metadata

Overview

This implementation documents the dataset preparation pattern used across PEFT causal language model examples. Two distinct patterns are demonstrated: (1) a direct tokenization pattern for simple text datasets, and (2) a chat template pattern for conversational instruction-following datasets. Both patterns produce tokenized datasets compatible with DataCollatorForLanguageModeling or SFTTrainer.

Pattern 1: Direct Tokenization (dora_finetuning example)

This pattern is used for plain text datasets where each example has a "text" field.

Source: examples/dora_finetuning/dora_finetuning.py:L92-103

from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling

# Load dataset from Hugging Face Hub or local path
dataset = load_dataset(data_path)

def tokenize_function(examples):
    inputs = tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=cutoff_len,
    )
    inputs["labels"] = inputs["input_ids"].copy()  # labels = input_ids for causal LM
    return inputs

# Tokenize the dataset with batched processing
tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset["train"].column_names,
)

# Data collator for dynamic batching (mlm=False for causal LM)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

Step-by-Step Breakdown

  1. Load dataset: load_dataset(data_path) loads from the Hugging Face Hub or a local path, returning a DatasetDict with splits.
  2. Define tokenize function: The function receives batched examples (a dict of lists), tokenizes the "text" field with fixed-length padding and truncation, and copies input_ids to labels.
  3. Apply tokenization: dataset.map() with batched=True processes examples in batches for efficiency. remove_columns drops the original text columns, leaving only numeric tensors.
  4. Create data collator: DataCollatorForLanguageModeling with mlm=False handles batch assembly for causal language modeling. It replaces padding tokens in labels with -100.

Pattern 2: Chat Template Processing (sft/utils.py example)

This pattern is used for conversational datasets with structured message lists.

Source: examples/sft/utils.py:L48-83

from datasets import DatasetDict, load_dataset, load_from_disk
from datasets.builder import DatasetGenerationError

def create_datasets(tokenizer, data_args, training_args, apply_chat_template=False):
    def preprocess(samples):
        batch = []
        for conversation in samples["messages"]:
            batch.append(tokenizer.apply_chat_template(conversation, tokenize=False))
        return {"content": batch}

    raw_datasets = DatasetDict()
    for split in data_args.splits.split(","):
        try:
            # Try loading from Hugging Face Hub
            dataset = load_dataset(data_args.dataset_name, split=split)
        except DatasetGenerationError:
            # Fall back to local disk
            dataset = load_from_disk(os.path.join(data_args.dataset_name, split))

        if "train" in split:
            raw_datasets["train"] = dataset
        elif "test" in split:
            raw_datasets["test"] = dataset

    if apply_chat_template:
        raw_datasets = raw_datasets.map(
            preprocess,
            batched=True,
            remove_columns=raw_datasets["train"].column_names,
        )

    train_data = raw_datasets["train"]
    valid_data = raw_datasets["test"]
    return train_data, valid_data

Step-by-Step Breakdown

  1. Define preprocessing function: For each conversation in the batch, tokenizer.apply_chat_template() converts the structured message list into a formatted string according to the tokenizer's configured chat template (ChatML, Zephyr, etc.).
  2. Load datasets with fallback: Attempts to load from Hub first; if that fails, falls back to loading from local disk. Handles both "train" and "test" splits.
  3. Apply chat template: When apply_chat_template=True, the structured messages are converted to flat text. The original columns are removed, leaving a single "content" column.
  4. Return split datasets: The function returns separate train and validation datasets ready for SFTTrainer consumption.

Chat Template Configuration

The chat template is configured on the tokenizer before dataset creation. From examples/sft/utils.py:

# ChatML template
DEFAULT_CHATML_CHAT_TEMPLATE = (
    "{% for message in messages %}\n"
    "{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}"
    "{% if loop.last and add_generation_prompt %}"
    "{{'<|im_start|>assistant\n'}}"
    "{% endif %}{% endfor %}"
)

# Zephyr template
DEFAULT_ZEPHYR_CHAT_TEMPLATE = (
    "{% for message in messages %}\n"
    "{% if message['role'] == 'user' %}\n"
    "{{ '<|user|>\n' + message['content'] + eos_token }}\n"
    "{% elif message['role'] == 'system' %}\n"
    "{{ '<|system|>\n' + message['content'] + eos_token }}\n"
    "{% elif message['role'] == 'assistant' %}\n"
    "{{ '<|assistant|>\n' + message['content'] + eos_token }}\n"
    "{% endif %}\n{% endfor %}"
)

# Apply to tokenizer
tokenizer.chat_template = chat_template

Key Parameters

Parameter Description Typical Values
max_length / cutoff_len Maximum sequence length for truncation 512, 1024, 2048, 4096
padding Padding strategy "max_length" (fixed), True (dynamic)
truncation Whether to truncate long sequences True
batched Process examples in batches during map True
remove_columns Columns to drop after tokenization Original text column names
mlm Masked language modeling flag (False for causal LM) False

Design Decisions

  • Labels equal input_ids: For causal LM, labels are simply a copy of input_ids. The model handles the one-position shift internally during loss computation.
  • Fixed vs. dynamic padding: The dora_finetuning example uses fixed-length padding (padding="max_length") during tokenization for simplicity. The SFT example delegates padding to the SFTTrainer/data collator for better memory efficiency.
  • Hub-first loading with fallback: The SFT utils attempt Hub loading first and fall back to local disk, enabling seamless use with both remote and local datasets.
  • Chat template as preprocessing: The chat template is applied as a text transformation before tokenization, producing a flat text string that is then tokenized normally by the SFTTrainer.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment