Implementation:PacktPublishing LLM Engineers Handbook TrainTestSplit To Huggingface

Aspect	Detail
API	`TrainTestSplit.to_huggingface(flatten: bool = False) -> DatasetDict`, then `DatasetDict.push_to_hub(repo_id: str)`
Source	llm_engineering/domain/dataset.py:L61-72 (to_huggingface), steps/generate_datasets/push_to_huggingface.py:L1-22 (ZenML step)
Type	Wrapper Doc
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_HuggingFace_Dataset_Publishing

Summary

The to_huggingface method on TrainTestSplit converts the internal domain dataset representation into a HuggingFace DatasetDict with "train" and "test" splits. It supports both flattened mode (all categories concatenated into single datasets) and structured mode (categories preserved as separate entries). The resulting DatasetDict can then be uploaded to HuggingFace Hub via push_to_hub().

Source Code

Conversion Method

def to_huggingface(self, flatten: bool = False) -> DatasetDict:
    train_datasets = {
        category.value: dataset.to_huggingface()
        for category, dataset in self.train.items()
    }
    test_datasets = {
        category.value: dataset.to_huggingface()
        for category, dataset in self.test.items()
    }

    if flatten:
        train_datasets = concatenate_datasets(list(train_datasets.values()))
        test_datasets = concatenate_datasets(list(test_datasets.values()))
    else:
        train_datasets = Dataset.from_dict(train_datasets)
        test_datasets = Dataset.from_dict(test_datasets)

    return DatasetDict({"train": train_datasets, "test": test_datasets})

ZenML Pipeline Step

@step
def push_to_huggingface(
    dataset: TrainTestSplit,
    repo_id: str,
) -> None:
    hf_dataset = dataset.to_huggingface(flatten=True)
    hf_dataset.push_to_hub(repo_id)

Import

from llm_engineering.domain.dataset import TrainTestSplit
from datasets import Dataset, DatasetDict, concatenate_datasets

Parameters

`to_huggingface`

Parameter	Type	Default	Description
`flatten`	`bool`	`False`	When `True`, concatenates all categories into single train/test datasets. When `False`, preserves category structure.

`push_to_hub`

Parameter	Type	Default	Description
`repo_id`	`str`	(required)	HuggingFace Hub repository identifier (e.g., `"username/dataset-name"`)

Return Value

Method	Return Type	Description
`to_huggingface`	`DatasetDict`	A HuggingFace `DatasetDict` with "train" and "test" keys, each containing a `Dataset` object
`push_to_hub`	`None`	Uploads the dataset to HuggingFace Hub (side effect)

Behavior

Conversion (`to_huggingface`)

Per-category conversion -- Iterates over self.train and self.test dictionaries, calling each category dataset's own to_huggingface() method to convert samples from pydantic models to Arrow-backed Dataset objects.
Flattening decision:
- If flatten=True: Uses concatenate_datasets() to merge all category datasets into a single Dataset per split. This is the typical choice for fine-tuning.
- If flatten=False: Preserves categories as separate entries using Dataset.from_dict().
DatasetDict construction -- Wraps the train and test datasets in a DatasetDict with standard "train" and "test" keys.

Upload (`push_to_hub`)

The DatasetDict.push_to_hub(repo_id) method (from the HuggingFace datasets library):

Creates the repository on HuggingFace Hub if it does not exist
Serializes the data to Parquet format
Uploads all files to the repository
Generates a dataset card with metadata

ZenML Step Integration

The ZenML step push_to_huggingface:

Receives the TrainTestSplit artifact from the upstream generation step
Calls to_huggingface(flatten=True) to produce a flat DatasetDict
Calls push_to_hub(repo_id) to upload to the specified repository

Usage Example

from llm_engineering.domain.dataset import InstructTrainTestSplit

# Assume split was created by the generation pipeline
split: InstructTrainTestSplit = ...

# Convert to HuggingFace format (flattened for training)
hf_dataset = split.to_huggingface(flatten=True)

# Inspect the dataset
print(hf_dataset)
# DatasetDict({
#     train: Dataset({features: ['instruction', 'answer'], num_rows: 800})
#     test: Dataset({features: ['instruction', 'answer'], num_rows: 200})
# })

# Upload to HuggingFace Hub
hf_dataset.push_to_hub("my-org/instruction-dataset-v1")

External Dependencies

Dependency	Purpose
`datasets` (HuggingFace)	`Dataset`, `DatasetDict`, `concatenate_datasets` for format conversion and hub upload
`zenml`	Pipeline step decorator and artifact management for the upload step

External Reference

Design Notes

Two-level conversion -- Each category dataset has its own to_huggingface() method that handles sample-level serialization, while the TrainTestSplit.to_huggingface() method handles the aggregation. This separation of concerns keeps each level simple.
Flatten as default for training -- The ZenML step uses flatten=True because most fine-tuning frameworks expect a single train/test split rather than category-segmented data.
Hub authentication -- The push_to_hub call requires a valid HuggingFace token, typically set via the HF_TOKEN environment variable or huggingface-cli login.
Idempotent uploads -- Repeated calls to push_to_hub with the same repo_id update the existing repository rather than creating duplicates.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment