Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook DatasetGenerator Generate

From Leeroopedia


Aspect Detail
API DatasetGenerator.generate(cls, prompts: dict[DataCategory, list[GenerateDatasetSamplesPrompt]], test_size: float = 0.2, mock: bool = False) -> TrainTestSplit
Source llm_engineering/application/dataset/generation.py:L93-167
Type API Doc
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_LLM_Dataset_Generation

Summary

The generate class method on DatasetGenerator (and its subclasses InstructionDatasetGenerator and PreferenceDatasetGenerator) orchestrates the end-to-end process of calling an LLM to produce synthetic training samples from structured prompts. It handles LLM initialization, output parsing, batch processing, error recovery, dataset construction, and train/test splitting. The method returns a complete TrainTestSplit object ready for publishing.

Source Code

@classmethod
def generate(cls, prompts, test_size=0.2, mock=False) -> TrainTestSplit:
    if mock:
        llm = FakeListLLM(
            responses=[constants.get_mocked_response(cls.dataset_type)]
        )
    else:
        llm = ChatOpenAI(
            model=settings.OPENAI_MODEL_ID,
            api_key=settings.OPENAI_API_KEY,
            max_tokens=2000 if cls.dataset_type == DatasetType.PREFERENCE else 1200,
            temperature=0.7,
        )

    parser = ListPydanticOutputParser(pydantic_object=cls._get_dataset_sample_type())
    chain = llm | parser

    datasets = {}
    for category, category_prompts in prompts.items():
        langchain_category_prompts = [
            _to_langchain(prompt) for prompt in category_prompts
        ]
        batches = utils.misc.batch(langchain_category_prompts, size=24)

        flattened = []
        for batch in batches:
            try:
                batched_samples = chain.batch(batch, stop=None)
                for sample_batch in batched_samples:
                    flattened.extend(sample_batch)
            except OutputParserException:
                logger.exception(
                    f"Failed to parse batch for category {category}"
                )

        dataset = domain.dataset.build_dataset(
            dataset_type=cls.dataset_type,
            category=category,
            samples=flattened,
        )
        datasets[category] = dataset

    processed_datasets = cls.post_process_datasets(datasets, test_size=test_size)

    return processed_datasets

Import

from llm_engineering.application.dataset.generation import (
    InstructionDatasetGenerator,
    PreferenceDatasetGenerator,
)

Parameters

Parameter Type Default Description
prompts dict[DataCategory, list[GenerateDatasetSamplesPrompt]] (required) Prompts grouped by category, as produced by get_prompts()
test_size float 0.2 Fraction of generated samples to reserve for the test set
mock bool False When True, uses a FakeListLLM instead of the real OpenAI API for testing

Return Value

Type Description
TrainTestSplit A container holding train and test datasets organized by category. Specifically either InstructTrainTestSplit or PreferenceTrainTestSplit depending on the subclass.

Behavior

The method executes the following steps:

1. LLM Initialization

  • In mock mode: Creates a FakeListLLM with predetermined responses from constants.get_mocked_response()
  • In production mode: Creates a ChatOpenAI instance with:
    • Model: settings.OPENAI_MODEL_ID (typically GPT-4o-mini)
    • Temperature: 0.7 for creative but grounded generation
    • Max tokens: 2000 for preference datasets, 1200 for instruction datasets

2. Chain Construction

  • Creates a ListPydanticOutputParser configured with the appropriate sample type (InstructDatasetSample or PreferenceDatasetSample)
  • Composes the chain as llm | parser using LangChain's pipe operator

3. Batch Processing

For each category in the prompts dictionary:

  • Converts domain prompts to LangChain format via _to_langchain()
  • Splits into batches of 24 prompts using utils.misc.batch()
  • Calls chain.batch(batch) for each batch, which:
    • Sends all prompts in the batch to the LLM concurrently
    • Parses each response into a list of typed sample objects
  • Flattens all sample lists into a single list per category
  • Catches OutputParserException for batches where the LLM returns malformed output

4. Dataset Construction

  • Calls domain.dataset.build_dataset() to wrap samples in the appropriate dataset domain object
  • Maps each category to its constructed dataset

5. Post-Processing and Splitting

  • Calls cls.post_process_datasets(datasets, test_size=test_size) which:
    • Applies category-specific filtering (e.g., removing short answers from preference datasets)
    • Performs train/test splitting via sklearn.model_selection.train_test_split
    • Returns a TrainTestSplit object

Usage Example

from llm_engineering.application.dataset.generation import InstructionDatasetGenerator

# Assume prompts were generated in the previous step
prompts = InstructionDatasetGenerator.get_prompts(documents)

# Generate instruction dataset (mock mode for testing)
train_test_split = InstructionDatasetGenerator.generate(
    prompts=prompts,
    test_size=0.2,
    mock=True,
)

print(f"Train samples: {len(train_test_split.train)}")
print(f"Test samples: {len(train_test_split.test)}")

External Dependencies

Dependency Purpose
langchain_openai ChatOpenAI wrapper for OpenAI API calls
langchain PromptTemplate, ListPydanticOutputParser, chain composition
loguru Structured logging for error reporting and progress tracking

Configuration

Setting Source Description
OPENAI_MODEL_ID settings The OpenAI model to use (e.g., gpt-4o-mini)
OPENAI_API_KEY settings API key for OpenAI authentication
Batch size Hardcoded 24 prompts per batch to manage rate limits
Temperature Hardcoded 0.7 for diverse but grounded output

Design Notes

  • The batch size of 24 is a pragmatic choice that balances throughput (more concurrent requests) against API rate limits and memory usage.
  • Error isolation at the batch level means a single malformed response only causes the loss of one batch, not the entire generation run.
  • The FakeListLLM mock enables deterministic testing of the entire pipeline without external API dependencies.
  • Using LangChain's pipe operator (|) creates a clean separation between LLM invocation and output parsing.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment