Implementation:PacktPublishing LLM Engineers Handbook HuggingFace Load Dataset

Field	Value
Implementation Name	HuggingFace Load Dataset
Type	Wrapper Doc (HuggingFace datasets)
Source File	llm_engineering/model/finetuning/finetune.py:L91-165
Workflow	LLM_Finetuning
Repo	PacktPublishing/LLM-Engineers-Handbook
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_Finetuning_Dataset_Preparation

Function Signatures

# Load a single dataset from HuggingFace Hub
load_dataset(path: str, split: str) -> Dataset

# Concatenate multiple datasets into one
concatenate_datasets(datasets: list[Dataset]) -> Dataset

# Apply a transformation function to dataset rows
dataset.map(function: Callable, batched: bool = True) -> Dataset

# Split dataset into train and test partitions
dataset.train_test_split(test_size: float) -> DatasetDict

Import

from datasets import load_dataset, concatenate_datasets

Description

The dataset preparation pipeline in the fine-tuning workflow loads datasets from HuggingFace Hub, concatenates multiple sources, applies format-specific templates, and creates train/test splits. The pipeline handles two distinct data formats depending on the fine-tuning objective: Alpaca format for SFT and chosen/rejected format for DPO.

SFT Dataset Preparation

Key Code

# From llm_engineering/model/finetuning/finetune.py

# Load primary dataset
dataset = load_dataset(f"{workspace}/llmtwin", split="train")

# Load supplementary dataset (first 10,000 examples)
extra_dataset = load_dataset("mlabonne/FineTome-Alpaca-100k", split="train[:10000]")

# Concatenate both datasets
dataset = concatenate_datasets([dataset, extra_dataset])

# Create train/test split
dataset = dataset.train_test_split(test_size=0.05)

# Apply Alpaca format template via dataset.map()
# The map function transforms each row into the Alpaca template format
# producing a "text" field consumed by SFTTrainer

SFT Format Template (Alpaca)

The formatting function wraps each example into the Alpaca instruction-following template:

# Conceptual formatting applied via dataset.map()
def format_alpaca(example):
    text = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request.\n\n"
        f"### Instruction:\n{example['instruction']}\n\n"
        f"### Response:\n{example['output']}"
    )
    return {"text": text}

DPO Dataset Preparation

Key Code

# For DPO fine-tuning, the dataset requires prompt/chosen/rejected columns
dataset = load_dataset(f"{workspace}/llmtwin_dpo", split="train")

# DPO format expects:
# - "prompt": the instruction/question
# - "chosen": the preferred response
# - "rejected": the non-preferred response
dataset = dataset.train_test_split(test_size=0.05)

Parameters

Parameter	Type	Description
`path`	`str`	HuggingFace dataset identifier (e.g., `"mlabonne/FineTome-Alpaca-100k"`) or `"{workspace}/llmtwin"`.
`split`	`str`	Dataset split to load. Supports slicing syntax (e.g., `"train[:10000]"` for first 10K examples).
`test_size`	`float`	Fraction of data for the test split. Set to `0.05` (5%) in the repository.
`batched`	`bool`	When `True`, the map function receives batches of examples rather than individual rows.

Outputs

A DatasetDict with two splits:

train: 95% of the formatted data, used for training.
test: 5% of the formatted data, used for evaluation during training.

Each example contains a "text" field (for SFT) or "prompt"/"chosen"/"rejected" fields (for DPO).

Data Flow Summary

Step	Operation	SFT	DPO
1	Load primary dataset	`load_dataset("{workspace}/llmtwin")`	`load_dataset("{workspace}/llmtwin_dpo")`
2	Load supplementary data	`load_dataset("mlabonne/FineTome-Alpaca-100k", split="train[:10000]")`	N/A
3	Concatenate	`concatenate_datasets([dataset, extra_dataset])`	N/A
4	Format	Apply Alpaca template via `dataset.map()`	Use existing prompt/chosen/rejected columns
5	Split	`train_test_split(test_size=0.05)`	`train_test_split(test_size=0.05)`

External Dependencies

Package	Purpose
`datasets`	HuggingFace Datasets library for loading, transforming, and splitting data

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Function Signatures

Import

Description

SFT Dataset Preparation

Key Code

SFT Format Template (Alpaca)

DPO Dataset Preparation

Key Code

Parameters

Outputs

Data Flow Summary

External Dependencies

See Also

Page Connections