Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook TrainTestSplit To Huggingface

From Leeroopedia


Aspect Detail
API TrainTestSplit.to_huggingface(flatten: bool = False) -> DatasetDict, then DatasetDict.push_to_hub(repo_id: str)
Source llm_engineering/domain/dataset.py:L61-72 (to_huggingface), steps/generate_datasets/push_to_huggingface.py:L1-22 (ZenML step)
Type Wrapper Doc
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_HuggingFace_Dataset_Publishing

Summary

The to_huggingface method on TrainTestSplit converts the internal domain dataset representation into a HuggingFace DatasetDict with "train" and "test" splits. It supports both flattened mode (all categories concatenated into single datasets) and structured mode (categories preserved as separate entries). The resulting DatasetDict can then be uploaded to HuggingFace Hub via push_to_hub().

Source Code

Conversion Method

def to_huggingface(self, flatten: bool = False) -> DatasetDict:
    train_datasets = {
        category.value: dataset.to_huggingface()
        for category, dataset in self.train.items()
    }
    test_datasets = {
        category.value: dataset.to_huggingface()
        for category, dataset in self.test.items()
    }

    if flatten:
        train_datasets = concatenate_datasets(list(train_datasets.values()))
        test_datasets = concatenate_datasets(list(test_datasets.values()))
    else:
        train_datasets = Dataset.from_dict(train_datasets)
        test_datasets = Dataset.from_dict(test_datasets)

    return DatasetDict({"train": train_datasets, "test": test_datasets})

ZenML Pipeline Step

@step
def push_to_huggingface(
    dataset: TrainTestSplit,
    repo_id: str,
) -> None:
    hf_dataset = dataset.to_huggingface(flatten=True)
    hf_dataset.push_to_hub(repo_id)

Import

from llm_engineering.domain.dataset import TrainTestSplit
from datasets import Dataset, DatasetDict, concatenate_datasets

Parameters

to_huggingface

Parameter Type Default Description
flatten bool False When True, concatenates all categories into single train/test datasets. When False, preserves category structure.

push_to_hub

Parameter Type Default Description
repo_id str (required) HuggingFace Hub repository identifier (e.g., "username/dataset-name")

Return Value

Method Return Type Description
to_huggingface DatasetDict A HuggingFace DatasetDict with "train" and "test" keys, each containing a Dataset object
push_to_hub None Uploads the dataset to HuggingFace Hub (side effect)

Behavior

Conversion (to_huggingface)

  1. Per-category conversion -- Iterates over self.train and self.test dictionaries, calling each category dataset's own to_huggingface() method to convert samples from pydantic models to Arrow-backed Dataset objects.
  2. Flattening decision:
    • If flatten=True: Uses concatenate_datasets() to merge all category datasets into a single Dataset per split. This is the typical choice for fine-tuning.
    • If flatten=False: Preserves categories as separate entries using Dataset.from_dict().
  3. DatasetDict construction -- Wraps the train and test datasets in a DatasetDict with standard "train" and "test" keys.

Upload (push_to_hub)

The DatasetDict.push_to_hub(repo_id) method (from the HuggingFace datasets library):

  1. Creates the repository on HuggingFace Hub if it does not exist
  2. Serializes the data to Parquet format
  3. Uploads all files to the repository
  4. Generates a dataset card with metadata

ZenML Step Integration

The ZenML step push_to_huggingface:

  1. Receives the TrainTestSplit artifact from the upstream generation step
  2. Calls to_huggingface(flatten=True) to produce a flat DatasetDict
  3. Calls push_to_hub(repo_id) to upload to the specified repository

Usage Example

from llm_engineering.domain.dataset import InstructTrainTestSplit

# Assume split was created by the generation pipeline
split: InstructTrainTestSplit = ...

# Convert to HuggingFace format (flattened for training)
hf_dataset = split.to_huggingface(flatten=True)

# Inspect the dataset
print(hf_dataset)
# DatasetDict({
#     train: Dataset({features: ['instruction', 'answer'], num_rows: 800})
#     test: Dataset({features: ['instruction', 'answer'], num_rows: 200})
# })

# Upload to HuggingFace Hub
hf_dataset.push_to_hub("my-org/instruction-dataset-v1")

External Dependencies

Dependency Purpose
datasets (HuggingFace) Dataset, DatasetDict, concatenate_datasets for format conversion and hub upload
zenml Pipeline step decorator and artifact management for the upload step

External Reference

Design Notes

  • Two-level conversion -- Each category dataset has its own to_huggingface() method that handles sample-level serialization, while the TrainTestSplit.to_huggingface() method handles the aggregation. This separation of concerns keeps each level simple.
  • Flatten as default for training -- The ZenML step uses flatten=True because most fine-tuning frameworks expect a single train/test split rather than category-segmented data.
  • Hub authentication -- The push_to_hub call requires a valid HuggingFace token, typically set via the HF_TOKEN environment variable or huggingface-cli login.
  • Idempotent uploads -- Repeated calls to push_to_hub with the same repo_id update the existing repository rather than creating duplicates.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment