Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:PacktPublishing LLM Engineers Handbook HuggingFace Load Dataset

From Leeroopedia


Field Value
Implementation Name HuggingFace Load Dataset
Type Wrapper Doc (HuggingFace datasets)
Source File llm_engineering/model/finetuning/finetune.py:L91-165
Workflow LLM_Finetuning
Repo PacktPublishing/LLM-Engineers-Handbook
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Finetuning_Dataset_Preparation

Function Signatures

# Load a single dataset from HuggingFace Hub
load_dataset(path: str, split: str) -> Dataset

# Concatenate multiple datasets into one
concatenate_datasets(datasets: list[Dataset]) -> Dataset

# Apply a transformation function to dataset rows
dataset.map(function: Callable, batched: bool = True) -> Dataset

# Split dataset into train and test partitions
dataset.train_test_split(test_size: float) -> DatasetDict

Import

from datasets import load_dataset, concatenate_datasets

Description

The dataset preparation pipeline in the fine-tuning workflow loads datasets from HuggingFace Hub, concatenates multiple sources, applies format-specific templates, and creates train/test splits. The pipeline handles two distinct data formats depending on the fine-tuning objective: Alpaca format for SFT and chosen/rejected format for DPO.

SFT Dataset Preparation

Key Code

# From llm_engineering/model/finetuning/finetune.py

# Load primary dataset
dataset = load_dataset(f"{workspace}/llmtwin", split="train")

# Load supplementary dataset (first 10,000 examples)
extra_dataset = load_dataset("mlabonne/FineTome-Alpaca-100k", split="train[:10000]")

# Concatenate both datasets
dataset = concatenate_datasets([dataset, extra_dataset])

# Create train/test split
dataset = dataset.train_test_split(test_size=0.05)

# Apply Alpaca format template via dataset.map()
# The map function transforms each row into the Alpaca template format
# producing a "text" field consumed by SFTTrainer

SFT Format Template (Alpaca)

The formatting function wraps each example into the Alpaca instruction-following template:

# Conceptual formatting applied via dataset.map()
def format_alpaca(example):
    text = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        f"### Instruction:\n{example['instruction']}\n\n"
        f"### Response:\n{example['output']}"
    )
    return {"text": text}

DPO Dataset Preparation

Key Code

# For DPO fine-tuning, the dataset requires prompt/chosen/rejected columns
dataset = load_dataset(f"{workspace}/llmtwin_dpo", split="train")

# DPO format expects:
# - "prompt": the instruction/question
# - "chosen": the preferred response
# - "rejected": the non-preferred response
dataset = dataset.train_test_split(test_size=0.05)

Parameters

Parameter Type Description
path str HuggingFace dataset identifier (e.g., "mlabonne/FineTome-Alpaca-100k") or "{workspace}/llmtwin".
split str Dataset split to load. Supports slicing syntax (e.g., "train[:10000]" for first 10K examples).
test_size float Fraction of data for the test split. Set to 0.05 (5%) in the repository.
batched bool When True, the map function receives batches of examples rather than individual rows.

Outputs

A DatasetDict with two splits:

  • train: 95% of the formatted data, used for training.
  • test: 5% of the formatted data, used for evaluation during training.

Each example contains a "text" field (for SFT) or "prompt"/"chosen"/"rejected" fields (for DPO).

Data Flow Summary

Step Operation SFT DPO
1 Load primary dataset load_dataset("{workspace}/llmtwin") load_dataset("{workspace}/llmtwin_dpo")
2 Load supplementary data load_dataset("mlabonne/FineTome-Alpaca-100k", split="train[:10000]") N/A
3 Concatenate concatenate_datasets([dataset, extra_dataset]) N/A
4 Format Apply Alpaca template via dataset.map() Use existing prompt/chosen/rejected columns
5 Split train_test_split(test_size=0.05) train_test_split(test_size=0.05)

External Dependencies

Package Purpose
datasets HuggingFace Datasets library for loading, transforming, and splitting data

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment