Implementation:PacktPublishing LLM Engineers Handbook DatasetGenerator Generate

Aspect	Detail
API	`DatasetGenerator.generate(cls, prompts: dict[DataCategory, list[GenerateDatasetSamplesPrompt]], test_size: float = 0.2, mock: bool = False) -> TrainTestSplit`
Source	llm_engineering/application/dataset/generation.py:L93-167
Type	API Doc
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_LLM_Dataset_Generation

Summary

The generate class method on DatasetGenerator (and its subclasses InstructionDatasetGenerator and PreferenceDatasetGenerator) orchestrates the end-to-end process of calling an LLM to produce synthetic training samples from structured prompts. It handles LLM initialization, output parsing, batch processing, error recovery, dataset construction, and train/test splitting. The method returns a complete TrainTestSplit object ready for publishing.

Source Code

@classmethod
def generate(cls, prompts, test_size=0.2, mock=False) -> TrainTestSplit:
    if mock:
        llm = FakeListLLM(
            responses=[constants.get_mocked_response(cls.dataset_type)]
        )
    else:
        llm = ChatOpenAI(
            model=settings.OPENAI_MODEL_ID,
            api_key=settings.OPENAI_API_KEY,
            max_tokens=2000 if cls.dataset_type == DatasetType.PREFERENCE else 1200,
            temperature=0.7,
        )

    parser = ListPydanticOutputParser(pydantic_object=cls._get_dataset_sample_type())
    chain = llm | parser

    datasets = {}
    for category, category_prompts in prompts.items():
        langchain_category_prompts = [
            _to_langchain(prompt) for prompt in category_prompts
        ]
        batches = utils.misc.batch(langchain_category_prompts, size=24)

        flattened = []
        for batch in batches:
            try:
                batched_samples = chain.batch(batch, stop=None)
                for sample_batch in batched_samples:
                    flattened.extend(sample_batch)
            except OutputParserException:
                logger.exception(
                    f"Failed to parse batch for category {category}"
                )

        dataset = domain.dataset.build_dataset(
            dataset_type=cls.dataset_type,
            category=category,
            samples=flattened,
        )
        datasets[category] = dataset

    processed_datasets = cls.post_process_datasets(datasets, test_size=test_size)

    return processed_datasets

Import

from llm_engineering.application.dataset.generation import (
    InstructionDatasetGenerator,
    PreferenceDatasetGenerator,
)

Parameters

Parameter	Type	Default	Description
`prompts`	`dict[DataCategory, list[GenerateDatasetSamplesPrompt]]`	(required)	Prompts grouped by category, as produced by `get_prompts()`
`test_size`	`float`	`0.2`	Fraction of generated samples to reserve for the test set
`mock`	`bool`	`False`	When `True`, uses a `FakeListLLM` instead of the real OpenAI API for testing

Return Value

Type	Description
`TrainTestSplit`	A container holding train and test datasets organized by category. Specifically either `InstructTrainTestSplit` or `PreferenceTrainTestSplit` depending on the subclass.

Behavior

The method executes the following steps:

1. LLM Initialization

In mock mode: Creates a FakeListLLM with predetermined responses from constants.get_mocked_response()
In production mode: Creates a ChatOpenAI instance with:
- Model: settings.OPENAI_MODEL_ID (typically GPT-4o-mini)
- Temperature: 0.7 for creative but grounded generation
- Max tokens: 2000 for preference datasets, 1200 for instruction datasets

2. Chain Construction

Creates a ListPydanticOutputParser configured with the appropriate sample type (InstructDatasetSample or PreferenceDatasetSample)
Composes the chain as llm | parser using LangChain's pipe operator

3. Batch Processing

For each category in the prompts dictionary:

Converts domain prompts to LangChain format via _to_langchain()
Splits into batches of 24 prompts using utils.misc.batch()
Calls chain.batch(batch) for each batch, which:
- Sends all prompts in the batch to the LLM concurrently
- Parses each response into a list of typed sample objects
Flattens all sample lists into a single list per category
Catches OutputParserException for batches where the LLM returns malformed output

4. Dataset Construction

Calls domain.dataset.build_dataset() to wrap samples in the appropriate dataset domain object
Maps each category to its constructed dataset

5. Post-Processing and Splitting

Calls cls.post_process_datasets(datasets, test_size=test_size) which:
- Applies category-specific filtering (e.g., removing short answers from preference datasets)
- Performs train/test splitting via sklearn.model_selection.train_test_split
- Returns a TrainTestSplit object

Usage Example

from llm_engineering.application.dataset.generation import InstructionDatasetGenerator

# Assume prompts were generated in the previous step
prompts = InstructionDatasetGenerator.get_prompts(documents)

# Generate instruction dataset (mock mode for testing)
train_test_split = InstructionDatasetGenerator.generate(
    prompts=prompts,
    test_size=0.2,
    mock=True,
)

print(f"Train samples: {len(train_test_split.train)}")
print(f"Test samples: {len(train_test_split.test)}")

External Dependencies

Dependency	Purpose
`langchain_openai`	`ChatOpenAI` wrapper for OpenAI API calls
`langchain`	`PromptTemplate`, `ListPydanticOutputParser`, chain composition
`loguru`	Structured logging for error reporting and progress tracking

Configuration

Setting	Source	Description
`OPENAI_MODEL_ID`	`settings`	The OpenAI model to use (e.g., `gpt-4o-mini`)
`OPENAI_API_KEY`	`settings`	API key for OpenAI authentication
Batch size	Hardcoded	`24` prompts per batch to manage rate limits
Temperature	Hardcoded	`0.7` for diverse but grounded output

Design Notes

The batch size of 24 is a pragmatic choice that balances throughput (more concurrent requests) against API rate limits and memory usage.
Error isolation at the batch level means a single malformed response only causes the loss of one batch, not the entire generation run.
The FakeListLLM mock enables deterministic testing of the entire pipeline without external API dependencies.
Using LangChain's pipe operator (|) creates a clean separation between LLM invocation and output parsing.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment