Implementation:PacktPublishing LLM Engineers Handbook TrainTestSplit To Huggingface
Appearance
| Aspect | Detail |
|---|---|
| API | TrainTestSplit.to_huggingface(flatten: bool = False) -> DatasetDict, then DatasetDict.push_to_hub(repo_id: str)
|
| Source | llm_engineering/domain/dataset.py:L61-72 (to_huggingface), steps/generate_datasets/push_to_huggingface.py:L1-22 (ZenML step) |
| Type | Wrapper Doc |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_HuggingFace_Dataset_Publishing |
Summary
The to_huggingface method on TrainTestSplit converts the internal domain dataset representation into a HuggingFace DatasetDict with "train" and "test" splits. It supports both flattened mode (all categories concatenated into single datasets) and structured mode (categories preserved as separate entries). The resulting DatasetDict can then be uploaded to HuggingFace Hub via push_to_hub().
Source Code
Conversion Method
def to_huggingface(self, flatten: bool = False) -> DatasetDict:
train_datasets = {
category.value: dataset.to_huggingface()
for category, dataset in self.train.items()
}
test_datasets = {
category.value: dataset.to_huggingface()
for category, dataset in self.test.items()
}
if flatten:
train_datasets = concatenate_datasets(list(train_datasets.values()))
test_datasets = concatenate_datasets(list(test_datasets.values()))
else:
train_datasets = Dataset.from_dict(train_datasets)
test_datasets = Dataset.from_dict(test_datasets)
return DatasetDict({"train": train_datasets, "test": test_datasets})
ZenML Pipeline Step
@step
def push_to_huggingface(
dataset: TrainTestSplit,
repo_id: str,
) -> None:
hf_dataset = dataset.to_huggingface(flatten=True)
hf_dataset.push_to_hub(repo_id)
Import
from llm_engineering.domain.dataset import TrainTestSplit
from datasets import Dataset, DatasetDict, concatenate_datasets
Parameters
to_huggingface
| Parameter | Type | Default | Description |
|---|---|---|---|
flatten |
bool |
False |
When True, concatenates all categories into single train/test datasets. When False, preserves category structure.
|
push_to_hub
| Parameter | Type | Default | Description |
|---|---|---|---|
repo_id |
str |
(required) | HuggingFace Hub repository identifier (e.g., "username/dataset-name")
|
Return Value
| Method | Return Type | Description |
|---|---|---|
to_huggingface |
DatasetDict |
A HuggingFace DatasetDict with "train" and "test" keys, each containing a Dataset object
|
push_to_hub |
None |
Uploads the dataset to HuggingFace Hub (side effect) |
Behavior
Conversion (to_huggingface)
- Per-category conversion -- Iterates over
self.trainandself.testdictionaries, calling each category dataset's ownto_huggingface()method to convert samples from pydantic models to Arrow-backedDatasetobjects. - Flattening decision:
- If
flatten=True: Usesconcatenate_datasets()to merge all category datasets into a singleDatasetper split. This is the typical choice for fine-tuning. - If
flatten=False: Preserves categories as separate entries usingDataset.from_dict().
- If
- DatasetDict construction -- Wraps the train and test datasets in a
DatasetDictwith standard "train" and "test" keys.
Upload (push_to_hub)
The DatasetDict.push_to_hub(repo_id) method (from the HuggingFace datasets library):
- Creates the repository on HuggingFace Hub if it does not exist
- Serializes the data to Parquet format
- Uploads all files to the repository
- Generates a dataset card with metadata
ZenML Step Integration
The ZenML step push_to_huggingface:
- Receives the
TrainTestSplitartifact from the upstream generation step - Calls
to_huggingface(flatten=True)to produce a flatDatasetDict - Calls
push_to_hub(repo_id)to upload to the specified repository
Usage Example
from llm_engineering.domain.dataset import InstructTrainTestSplit
# Assume split was created by the generation pipeline
split: InstructTrainTestSplit = ...
# Convert to HuggingFace format (flattened for training)
hf_dataset = split.to_huggingface(flatten=True)
# Inspect the dataset
print(hf_dataset)
# DatasetDict({
# train: Dataset({features: ['instruction', 'answer'], num_rows: 800})
# test: Dataset({features: ['instruction', 'answer'], num_rows: 200})
# })
# Upload to HuggingFace Hub
hf_dataset.push_to_hub("my-org/instruction-dataset-v1")
External Dependencies
| Dependency | Purpose |
|---|---|
datasets (HuggingFace) |
Dataset, DatasetDict, concatenate_datasets for format conversion and hub upload
|
zenml |
Pipeline step decorator and artifact management for the upload step |
External Reference
Design Notes
- Two-level conversion -- Each category dataset has its own
to_huggingface()method that handles sample-level serialization, while theTrainTestSplit.to_huggingface()method handles the aggregation. This separation of concerns keeps each level simple. - Flatten as default for training -- The ZenML step uses
flatten=Truebecause most fine-tuning frameworks expect a single train/test split rather than category-segmented data. - Hub authentication -- The
push_to_hubcall requires a valid HuggingFace token, typically set via theHF_TOKENenvironment variable orhuggingface-cli login. - Idempotent uploads -- Repeated calls to
push_to_hubwith the samerepo_idupdate the existing repository rather than creating duplicates.
See Also
- Principle:PacktPublishing_LLM_Engineers_Handbook_HuggingFace_Dataset_Publishing -- The principle this implementation realizes
- Implementation:PacktPublishing_LLM_Engineers_Handbook_Create_Train_Test_Split -- The preceding step that produces the train/test split
- Environment:PacktPublishing_LLM_Engineers_Handbook_API_Credentials
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment