Implementation:PacktPublishing LLM Engineers Handbook HuggingFace Load Dataset
Appearance
| Field | Value |
|---|---|
| Implementation Name | HuggingFace Load Dataset |
| Type | Wrapper Doc (HuggingFace datasets) |
| Source File | llm_engineering/model/finetuning/finetune.py:L91-165 |
| Workflow | LLM_Finetuning |
| Repo | PacktPublishing/LLM-Engineers-Handbook |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_Finetuning_Dataset_Preparation |
Function Signatures
# Load a single dataset from HuggingFace Hub
load_dataset(path: str, split: str) -> Dataset
# Concatenate multiple datasets into one
concatenate_datasets(datasets: list[Dataset]) -> Dataset
# Apply a transformation function to dataset rows
dataset.map(function: Callable, batched: bool = True) -> Dataset
# Split dataset into train and test partitions
dataset.train_test_split(test_size: float) -> DatasetDict
Import
from datasets import load_dataset, concatenate_datasets
Description
The dataset preparation pipeline in the fine-tuning workflow loads datasets from HuggingFace Hub, concatenates multiple sources, applies format-specific templates, and creates train/test splits. The pipeline handles two distinct data formats depending on the fine-tuning objective: Alpaca format for SFT and chosen/rejected format for DPO.
SFT Dataset Preparation
Key Code
# From llm_engineering/model/finetuning/finetune.py
# Load primary dataset
dataset = load_dataset(f"{workspace}/llmtwin", split="train")
# Load supplementary dataset (first 10,000 examples)
extra_dataset = load_dataset("mlabonne/FineTome-Alpaca-100k", split="train[:10000]")
# Concatenate both datasets
dataset = concatenate_datasets([dataset, extra_dataset])
# Create train/test split
dataset = dataset.train_test_split(test_size=0.05)
# Apply Alpaca format template via dataset.map()
# The map function transforms each row into the Alpaca template format
# producing a "text" field consumed by SFTTrainer
SFT Format Template (Alpaca)
The formatting function wraps each example into the Alpaca instruction-following template:
# Conceptual formatting applied via dataset.map()
def format_alpaca(example):
text = (
"Below is an instruction that describes a task. "
"Write a response that appropriately completes the request.\n\n"
f"### Instruction:\n{example['instruction']}\n\n"
f"### Response:\n{example['output']}"
)
return {"text": text}
DPO Dataset Preparation
Key Code
# For DPO fine-tuning, the dataset requires prompt/chosen/rejected columns
dataset = load_dataset(f"{workspace}/llmtwin_dpo", split="train")
# DPO format expects:
# - "prompt": the instruction/question
# - "chosen": the preferred response
# - "rejected": the non-preferred response
dataset = dataset.train_test_split(test_size=0.05)
Parameters
| Parameter | Type | Description |
|---|---|---|
path |
str |
HuggingFace dataset identifier (e.g., "mlabonne/FineTome-Alpaca-100k") or "{workspace}/llmtwin".
|
split |
str |
Dataset split to load. Supports slicing syntax (e.g., "train[:10000]" for first 10K examples).
|
test_size |
float |
Fraction of data for the test split. Set to 0.05 (5%) in the repository.
|
batched |
bool |
When True, the map function receives batches of examples rather than individual rows.
|
Outputs
A DatasetDict with two splits:
train: 95% of the formatted data, used for training.test: 5% of the formatted data, used for evaluation during training.
Each example contains a "text" field (for SFT) or "prompt"/"chosen"/"rejected" fields (for DPO).
Data Flow Summary
| Step | Operation | SFT | DPO |
|---|---|---|---|
| 1 | Load primary dataset | load_dataset("{workspace}/llmtwin") |
load_dataset("{workspace}/llmtwin_dpo")
|
| 2 | Load supplementary data | load_dataset("mlabonne/FineTome-Alpaca-100k", split="train[:10000]") |
N/A |
| 3 | Concatenate | concatenate_datasets([dataset, extra_dataset]) |
N/A |
| 4 | Format | Apply Alpaca template via dataset.map() |
Use existing prompt/chosen/rejected columns |
| 5 | Split | train_test_split(test_size=0.05) |
train_test_split(test_size=0.05)
|
External Dependencies
| Package | Purpose |
|---|---|
datasets |
HuggingFace Datasets library for loading, transforming, and splitting data |
See Also
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment