Workflow:PacktPublishing LLM Engineers Handbook Dataset Generation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, LLMs, Dataset_Generation |
| Last Updated | 2026-02-08 07:45 GMT |
Overview
End-to-end process for generating fine-tuning datasets (instruction-following and preference alignment) from cleaned documents using GPT-4o-mini, with automatic train/test splitting and optional upload to HuggingFace Hub.
Description
This workflow generates synthetic training data for fine-tuning LLMs by prompting GPT-4o-mini to produce structured samples from cleaned documents. It supports two dataset types: instruction datasets (instruction/answer pairs for SFT) and preference datasets (instruction/chosen/rejected triples for DPO). The pipeline retrieves cleaned documents from Qdrant, constructs Jinja2-templated prompts, batches them through the OpenAI API via LangChain, post-processes the results (filtering, splitting), and optionally pushes the final datasets to HuggingFace Hub.
Usage
Execute this workflow after the Feature Engineering pipeline has produced cleaned documents in Qdrant. You need to create fine-tuning datasets for either supervised fine-tuning (instruction format) or direct preference optimization (preference format). Run the instruction dataset generation first, then the preference dataset generation, as the training pipeline requires both.
Execution Steps
Step 1: Query Feature Store
Retrieve cleaned documents from the Qdrant vector database. These documents were produced by the Feature Engineering pipeline and contain the cleaned text that will serve as context for dataset sample generation.
Key considerations:
- Queries all CleanedDocument types from Qdrant (articles, posts, repositories)
- Documents are grouped by DataCategory for type-specific prompt generation
- The pipeline configuration controls which dataset type to generate
Step 2: Create Prompts
Generate structured prompts from the cleaned documents using the appropriate DatasetGenerator subclass. Each document is formatted into a Jinja2 template that instructs GPT-4o-mini to produce either instruction-answer pairs or instruction-chosen-rejected triples. Prompts are token-counted and truncated if they exceed the model's context window.
Key considerations:
- InstructionDatasetGenerator produces prompts requesting 5 instruction-answer pairs per document
- PreferenceDatasetGenerator produces prompts requesting 5 instruction-chosen-rejected triples
- Token counting uses tiktoken to ensure prompts fit within the OpenAI model context window
- Prompts are grouped by DataCategory for batched processing
Step 3: Generate Dataset Samples
Send the prompts to GPT-4o-mini via LangChain's ChatOpenAI interface and parse the JSON responses into typed dataset sample objects. Prompts are processed in batches of 24 for efficiency. A mock mode is available that returns predefined responses for development and testing without API calls.
Key considerations:
- Uses LangChain's batch processing with a custom ListPydanticOutputParser
- Failed JSON parses are logged and skipped rather than halting the pipeline
- Mock mode uses FakeListLLM with predefined responses from constants module
- System prompt instructs the model to generate samples in the appropriate format
Step 4: Post-process and Split
Apply dataset-type-specific post-processing and create train/test splits. For instruction datasets, this involves a simple random split. For preference datasets, additional filtering removes short answers and answers with incorrect formatting before splitting.
Key considerations:
- Instruction datasets are split with a configurable test_size ratio (default 10%)
- Preference datasets undergo two extra filtering passes before splitting
- Random state is fixed (seed=42) for reproducible splits
- The result is a TrainTestSplit domain object containing both splits
Step 5: Push to HuggingFace Hub
Optionally upload the generated train and test dataset splits to HuggingFace Hub. The datasets are converted from domain models to HuggingFace Dataset objects and pushed to the configured repository.
Key considerations:
- Push is controlled by the push_to_huggingface pipeline parameter
- Dataset ID on HuggingFace is configurable (defaults to llmtwin or llmtwin-dpo)
- Requires a valid HUGGINGFACE_ACCESS_TOKEN in environment configuration
- Both train and test splits are uploaded as separate dataset splits