Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook Create Train Test Split

From Leeroopedia
Revision as of 16:17, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/PacktPublishing_LLM_Engineers_Handbook_Create_Train_Test_Split.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Aspect Detail
API create_instruct_train_test_split(data, test_size=0.2, random_state=42) -> InstructTrainTestSplit and create_preference_train_test_split(data, test_size=0.2, random_state=42) -> PreferenceTrainTestSplit
Source llm_engineering/application/dataset/utils.py:L16-71
Type API Doc
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Dataset_Splitting

Summary

The create_instruct_train_test_split and create_preference_train_test_split functions partition generated datasets into reproducible train/test splits. They iterate over categories, serialize samples to dictionaries, apply sklearn's train_test_split with a fixed random state, then reconstruct typed domain objects. Each function returns a typed TrainTestSplit container with the partitioned data.

Source Code

Instruction Dataset Split

def create_instruct_train_test_split(
    data: dict[DataCategory, InstructDataset],
    test_size=0.2,
    random_state=42,
) -> InstructTrainTestSplit:
    train_data = {}
    test_data = {}

    for category, dataset in data.items():
        samples = dataset.samples
        samples_dicts = [sample.model_dump() for sample in samples]

        if len(samples_dicts) > 0:
            train_samples_dicts, test_samples_dicts = train_test_split(
                samples_dicts, test_size=test_size, random_state=random_state
            )
            train_samples = [
                InstructDatasetSample(**d) for d in train_samples_dicts
            ]
            test_samples = [
                InstructDatasetSample(**d) for d in test_samples_dicts
            ]
        else:
            train_samples, test_samples = [], []

        train_data[category] = InstructDataset(
            category=category, samples=train_samples
        )
        test_data[category] = InstructDataset(
            category=category, samples=test_samples
        )

    return InstructTrainTestSplit(
        train=train_data, test=test_data, test_split_size=test_size
    )

Preference Dataset Split

The preference variant follows the same structure but operates on PreferenceDataset and PreferenceDatasetSample types:

def create_preference_train_test_split(
    data: dict[DataCategory, PreferenceDataset],
    test_size=0.2,
    random_state=42,
) -> PreferenceTrainTestSplit:
    train_data = {}
    test_data = {}

    for category, dataset in data.items():
        samples = dataset.samples
        samples_dicts = [sample.model_dump() for sample in samples]

        if len(samples_dicts) > 0:
            train_samples_dicts, test_samples_dicts = train_test_split(
                samples_dicts, test_size=test_size, random_state=random_state
            )
            train_samples = [
                PreferenceDatasetSample(**d) for d in train_samples_dicts
            ]
            test_samples = [
                PreferenceDatasetSample(**d) for d in test_samples_dicts
            ]
        else:
            train_samples, test_samples = [], []

        train_data[category] = PreferenceDataset(
            category=category, samples=train_samples
        )
        test_data[category] = PreferenceDataset(
            category=category, samples=test_samples
        )

    return PreferenceTrainTestSplit(
        train=train_data, test=test_data, test_split_size=test_size
    )

Import

from llm_engineering.application.dataset.utils import (
    create_instruct_train_test_split,
    create_preference_train_test_split,
)

Parameters

Parameter Type Default Description
data dict[DataCategory, InstructDataset] or dict[DataCategory, PreferenceDataset] (required) Generated datasets grouped by data category
test_size float 0.2 Fraction of samples to allocate to the test set (0.0 to 1.0)
random_state int 42 Random seed for reproducible splits

Return Value

Function Return Type Description
create_instruct_train_test_split InstructTrainTestSplit Contains train and test dictionaries mapping DataCategory to InstructDataset, plus test_split_size
create_preference_train_test_split PreferenceTrainTestSplit Contains train and test dictionaries mapping DataCategory to PreferenceDataset, plus test_split_size

Behavior

Both functions follow the same algorithm:

  1. Iterate over each category in the input dictionary
  2. Serialize samples to dictionaries via pydantic's model_dump()
  3. Check if the category has any samples (handle empty categories gracefully)
  4. Split using sklearn.model_selection.train_test_split() with the specified test_size and random_state
  5. Deserialize the split dictionaries back into typed sample objects
  6. Wrap samples in dataset containers with the appropriate category
  7. Return a TrainTestSplit object containing all categories' train and test datasets

Usage Example

from llm_engineering.application.dataset.utils import create_instruct_train_test_split
from llm_engineering.domain.dataset import InstructDataset, DataCategory

# Assume datasets were generated by DatasetGenerator.generate()
data = {
    DataCategory.ARTICLES: InstructDataset(category=DataCategory.ARTICLES, samples=[...]),
    DataCategory.POSTS: InstructDataset(category=DataCategory.POSTS, samples=[...]),
}

split = create_instruct_train_test_split(data, test_size=0.2, random_state=42)

for category in split.train:
    train_count = len(split.train[category].samples)
    test_count = len(split.test[category].samples)
    print(f"{category.value}: {train_count} train, {test_count} test")

External Dependencies

Dependency Purpose
sklearn.model_selection.train_test_split Core splitting algorithm with random state support for reproducibility

Design Notes

  • Serialization round-trip -- Samples are converted to dictionaries before splitting and reconstructed after. This is necessary because sklearn's train_test_split operates on lists of arbitrary objects, and the round-trip through model_dump() / **dict ensures clean pydantic validation on reconstruction.
  • Per-category splitting -- Each category is split independently to preserve proportional representation across all categories in both splits.
  • Empty category handling -- The explicit if len(samples_dicts) > 0 check avoids an error from sklearn when train_test_split receives an empty list.
  • Two separate functions -- Rather than a single generic function, two type-specific functions are provided for clear typing and IDE support.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment