Implementation:PacktPublishing LLM Engineers Handbook Create Train Test Split
| Aspect | Detail |
|---|---|
| API | create_instruct_train_test_split(data, test_size=0.2, random_state=42) -> InstructTrainTestSplit and create_preference_train_test_split(data, test_size=0.2, random_state=42) -> PreferenceTrainTestSplit
|
| Source | llm_engineering/application/dataset/utils.py:L16-71 |
| Type | API Doc |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_Dataset_Splitting |
Summary
The create_instruct_train_test_split and create_preference_train_test_split functions partition generated datasets into reproducible train/test splits. They iterate over categories, serialize samples to dictionaries, apply sklearn's train_test_split with a fixed random state, then reconstruct typed domain objects. Each function returns a typed TrainTestSplit container with the partitioned data.
Source Code
Instruction Dataset Split
def create_instruct_train_test_split(
data: dict[DataCategory, InstructDataset],
test_size=0.2,
random_state=42,
) -> InstructTrainTestSplit:
train_data = {}
test_data = {}
for category, dataset in data.items():
samples = dataset.samples
samples_dicts = [sample.model_dump() for sample in samples]
if len(samples_dicts) > 0:
train_samples_dicts, test_samples_dicts = train_test_split(
samples_dicts, test_size=test_size, random_state=random_state
)
train_samples = [
InstructDatasetSample(**d) for d in train_samples_dicts
]
test_samples = [
InstructDatasetSample(**d) for d in test_samples_dicts
]
else:
train_samples, test_samples = [], []
train_data[category] = InstructDataset(
category=category, samples=train_samples
)
test_data[category] = InstructDataset(
category=category, samples=test_samples
)
return InstructTrainTestSplit(
train=train_data, test=test_data, test_split_size=test_size
)
Preference Dataset Split
The preference variant follows the same structure but operates on PreferenceDataset and PreferenceDatasetSample types:
def create_preference_train_test_split(
data: dict[DataCategory, PreferenceDataset],
test_size=0.2,
random_state=42,
) -> PreferenceTrainTestSplit:
train_data = {}
test_data = {}
for category, dataset in data.items():
samples = dataset.samples
samples_dicts = [sample.model_dump() for sample in samples]
if len(samples_dicts) > 0:
train_samples_dicts, test_samples_dicts = train_test_split(
samples_dicts, test_size=test_size, random_state=random_state
)
train_samples = [
PreferenceDatasetSample(**d) for d in train_samples_dicts
]
test_samples = [
PreferenceDatasetSample(**d) for d in test_samples_dicts
]
else:
train_samples, test_samples = [], []
train_data[category] = PreferenceDataset(
category=category, samples=train_samples
)
test_data[category] = PreferenceDataset(
category=category, samples=test_samples
)
return PreferenceTrainTestSplit(
train=train_data, test=test_data, test_split_size=test_size
)
Import
from llm_engineering.application.dataset.utils import (
create_instruct_train_test_split,
create_preference_train_test_split,
)
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
data |
dict[DataCategory, InstructDataset] or dict[DataCategory, PreferenceDataset] |
(required) | Generated datasets grouped by data category |
test_size |
float |
0.2 |
Fraction of samples to allocate to the test set (0.0 to 1.0) |
random_state |
int |
42 |
Random seed for reproducible splits |
Return Value
| Function | Return Type | Description |
|---|---|---|
create_instruct_train_test_split |
InstructTrainTestSplit |
Contains train and test dictionaries mapping DataCategory to InstructDataset, plus test_split_size
|
create_preference_train_test_split |
PreferenceTrainTestSplit |
Contains train and test dictionaries mapping DataCategory to PreferenceDataset, plus test_split_size
|
Behavior
Both functions follow the same algorithm:
- Iterate over each category in the input dictionary
- Serialize samples to dictionaries via pydantic's
model_dump() - Check if the category has any samples (handle empty categories gracefully)
- Split using
sklearn.model_selection.train_test_split()with the specifiedtest_sizeandrandom_state - Deserialize the split dictionaries back into typed sample objects
- Wrap samples in dataset containers with the appropriate category
- Return a
TrainTestSplitobject containing all categories' train and test datasets
Usage Example
from llm_engineering.application.dataset.utils import create_instruct_train_test_split
from llm_engineering.domain.dataset import InstructDataset, DataCategory
# Assume datasets were generated by DatasetGenerator.generate()
data = {
DataCategory.ARTICLES: InstructDataset(category=DataCategory.ARTICLES, samples=[...]),
DataCategory.POSTS: InstructDataset(category=DataCategory.POSTS, samples=[...]),
}
split = create_instruct_train_test_split(data, test_size=0.2, random_state=42)
for category in split.train:
train_count = len(split.train[category].samples)
test_count = len(split.test[category].samples)
print(f"{category.value}: {train_count} train, {test_count} test")
External Dependencies
| Dependency | Purpose |
|---|---|
sklearn.model_selection.train_test_split |
Core splitting algorithm with random state support for reproducibility |
Design Notes
- Serialization round-trip -- Samples are converted to dictionaries before splitting and reconstructed after. This is necessary because sklearn's
train_test_splitoperates on lists of arbitrary objects, and the round-trip throughmodel_dump()/**dictensures clean pydantic validation on reconstruction. - Per-category splitting -- Each category is split independently to preserve proportional representation across all categories in both splits.
- Empty category handling -- The explicit
if len(samples_dicts) > 0check avoids an error from sklearn whentrain_test_splitreceives an empty list. - Two separate functions -- Rather than a single generic function, two type-specific functions are provided for clear typing and IDE support.
See Also
- Principle:PacktPublishing_LLM_Engineers_Handbook_Dataset_Splitting -- The principle this implementation realizes
- Implementation:PacktPublishing_LLM_Engineers_Handbook_DatasetGenerator_Generate -- The preceding step that produces raw datasets
- Implementation:PacktPublishing_LLM_Engineers_Handbook_TrainTestSplit_To_Huggingface -- The next step that converts splits for publishing
- Heuristic:PacktPublishing_LLM_Engineers_Handbook_Dataset_Generation_Quality_Filters