Summary
The generate class method on DatasetGenerator (and its subclasses InstructionDatasetGenerator and PreferenceDatasetGenerator) orchestrates the end-to-end process of calling an LLM to produce synthetic training samples from structured prompts. It handles LLM initialization, output parsing, batch processing, error recovery, dataset construction, and train/test splitting. The method returns a complete TrainTestSplit object ready for publishing.
Source Code
@classmethod
def generate(cls, prompts, test_size=0.2, mock=False) -> TrainTestSplit:
if mock:
llm = FakeListLLM(
responses=[constants.get_mocked_response(cls.dataset_type)]
)
else:
llm = ChatOpenAI(
model=settings.OPENAI_MODEL_ID,
api_key=settings.OPENAI_API_KEY,
max_tokens=2000 if cls.dataset_type == DatasetType.PREFERENCE else 1200,
temperature=0.7,
)
parser = ListPydanticOutputParser(pydantic_object=cls._get_dataset_sample_type())
chain = llm | parser
datasets = {}
for category, category_prompts in prompts.items():
langchain_category_prompts = [
_to_langchain(prompt) for prompt in category_prompts
]
batches = utils.misc.batch(langchain_category_prompts, size=24)
flattened = []
for batch in batches:
try:
batched_samples = chain.batch(batch, stop=None)
for sample_batch in batched_samples:
flattened.extend(sample_batch)
except OutputParserException:
logger.exception(
f"Failed to parse batch for category {category}"
)
dataset = domain.dataset.build_dataset(
dataset_type=cls.dataset_type,
category=category,
samples=flattened,
)
datasets[category] = dataset
processed_datasets = cls.post_process_datasets(datasets, test_size=test_size)
return processed_datasets
Import
from llm_engineering.application.dataset.generation import (
InstructionDatasetGenerator,
PreferenceDatasetGenerator,
)
Parameters
| Parameter |
Type |
Default |
Description
|
prompts |
dict[DataCategory, list[GenerateDatasetSamplesPrompt]] |
(required) |
Prompts grouped by category, as produced by get_prompts()
|
test_size |
float |
0.2 |
Fraction of generated samples to reserve for the test set
|
mock |
bool |
False |
When True, uses a FakeListLLM instead of the real OpenAI API for testing
|
Return Value
| Type |
Description
|
TrainTestSplit |
A container holding train and test datasets organized by category. Specifically either InstructTrainTestSplit or PreferenceTrainTestSplit depending on the subclass.
|
Behavior
The method executes the following steps:
1. LLM Initialization
- In mock mode: Creates a
FakeListLLM with predetermined responses from constants.get_mocked_response()
- In production mode: Creates a
ChatOpenAI instance with:
- Model:
settings.OPENAI_MODEL_ID (typically GPT-4o-mini)
- Temperature:
0.7 for creative but grounded generation
- Max tokens:
2000 for preference datasets, 1200 for instruction datasets
2. Chain Construction
- Creates a
ListPydanticOutputParser configured with the appropriate sample type (InstructDatasetSample or PreferenceDatasetSample)
- Composes the chain as
llm | parser using LangChain's pipe operator
3. Batch Processing
For each category in the prompts dictionary:
- Converts domain prompts to LangChain format via
_to_langchain()
- Splits into batches of 24 prompts using
utils.misc.batch()
- Calls
chain.batch(batch) for each batch, which:
- Sends all prompts in the batch to the LLM concurrently
- Parses each response into a list of typed sample objects
- Flattens all sample lists into a single list per category
- Catches
OutputParserException for batches where the LLM returns malformed output
4. Dataset Construction
- Calls
domain.dataset.build_dataset() to wrap samples in the appropriate dataset domain object
- Maps each category to its constructed dataset
5. Post-Processing and Splitting
- Calls
cls.post_process_datasets(datasets, test_size=test_size) which:
- Applies category-specific filtering (e.g., removing short answers from preference datasets)
- Performs train/test splitting via
sklearn.model_selection.train_test_split
- Returns a
TrainTestSplit object
Usage Example
from llm_engineering.application.dataset.generation import InstructionDatasetGenerator
# Assume prompts were generated in the previous step
prompts = InstructionDatasetGenerator.get_prompts(documents)
# Generate instruction dataset (mock mode for testing)
train_test_split = InstructionDatasetGenerator.generate(
prompts=prompts,
test_size=0.2,
mock=True,
)
print(f"Train samples: {len(train_test_split.train)}")
print(f"Test samples: {len(train_test_split.test)}")
External Dependencies
| Dependency |
Purpose
|
langchain_openai |
ChatOpenAI wrapper for OpenAI API calls
|
langchain |
PromptTemplate, ListPydanticOutputParser, chain composition
|
loguru |
Structured logging for error reporting and progress tracking
|
Configuration
| Setting |
Source |
Description
|
OPENAI_MODEL_ID |
settings |
The OpenAI model to use (e.g., gpt-4o-mini)
|
OPENAI_API_KEY |
settings |
API key for OpenAI authentication
|
| Batch size |
Hardcoded |
24 prompts per batch to manage rate limits
|
| Temperature |
Hardcoded |
0.7 for diverse but grounded output
|
Design Notes
- The batch size of 24 is a pragmatic choice that balances throughput (more concurrent requests) against API rate limits and memory usage.
- Error isolation at the batch level means a single malformed response only causes the loss of one batch, not the entire generation run.
- The
FakeListLLM mock enables deterministic testing of the entire pipeline without external API dependencies.
- Using LangChain's pipe operator (
|) creates a clean separation between LLM invocation and output parsing.
See Also
Page Connections
Double-click a node to navigate. Hold to expand connections.