Workflow:PacktPublishing LLM Engineers Handbook Dataset Generation

Knowledge Sources	LLM Engineers Handbook OpenAI API Docs HuggingFace Datasets Docs LangChain Docs
Domains	Data_Engineering, LLMs, Dataset_Generation
Last Updated	2026-02-08 07:45 GMT

Overview

End-to-end process for generating fine-tuning datasets (instruction-following and preference alignment) from cleaned documents using GPT-4o-mini, with automatic train/test splitting and optional upload to HuggingFace Hub.

Description

This workflow generates synthetic training data for fine-tuning LLMs by prompting GPT-4o-mini to produce structured samples from cleaned documents. It supports two dataset types: instruction datasets (instruction/answer pairs for SFT) and preference datasets (instruction/chosen/rejected triples for DPO). The pipeline retrieves cleaned documents from Qdrant, constructs Jinja2-templated prompts, batches them through the OpenAI API via LangChain, post-processes the results (filtering, splitting), and optionally pushes the final datasets to HuggingFace Hub.

Usage

Execute this workflow after the Feature Engineering pipeline has produced cleaned documents in Qdrant. You need to create fine-tuning datasets for either supervised fine-tuning (instruction format) or direct preference optimization (preference format). Run the instruction dataset generation first, then the preference dataset generation, as the training pipeline requires both.

Execution Steps

Step 1: Query Feature Store

Retrieve cleaned documents from the Qdrant vector database. These documents were produced by the Feature Engineering pipeline and contain the cleaned text that will serve as context for dataset sample generation.

Key considerations:

Queries all CleanedDocument types from Qdrant (articles, posts, repositories)
Documents are grouped by DataCategory for type-specific prompt generation
The pipeline configuration controls which dataset type to generate

Step 2: Create Prompts

Generate structured prompts from the cleaned documents using the appropriate DatasetGenerator subclass. Each document is formatted into a Jinja2 template that instructs GPT-4o-mini to produce either instruction-answer pairs or instruction-chosen-rejected triples. Prompts are token-counted and truncated if they exceed the model's context window.

Key considerations:

InstructionDatasetGenerator produces prompts requesting 5 instruction-answer pairs per document
PreferenceDatasetGenerator produces prompts requesting 5 instruction-chosen-rejected triples
Token counting uses tiktoken to ensure prompts fit within the OpenAI model context window
Prompts are grouped by DataCategory for batched processing

Step 3: Generate Dataset Samples

Send the prompts to GPT-4o-mini via LangChain's ChatOpenAI interface and parse the JSON responses into typed dataset sample objects. Prompts are processed in batches of 24 for efficiency. A mock mode is available that returns predefined responses for development and testing without API calls.

Key considerations:

Uses LangChain's batch processing with a custom ListPydanticOutputParser
Failed JSON parses are logged and skipped rather than halting the pipeline
Mock mode uses FakeListLLM with predefined responses from constants module
System prompt instructs the model to generate samples in the appropriate format

Step 4: Post-process and Split

Apply dataset-type-specific post-processing and create train/test splits. For instruction datasets, this involves a simple random split. For preference datasets, additional filtering removes short answers and answers with incorrect formatting before splitting.

Key considerations:

Instruction datasets are split with a configurable test_size ratio (default 10%)
Preference datasets undergo two extra filtering passes before splitting
Random state is fixed (seed=42) for reproducible splits
The result is a TrainTestSplit domain object containing both splits

Step 5: Push to HuggingFace Hub

Optionally upload the generated train and test dataset splits to HuggingFace Hub. The datasets are converted from domain models to HuggingFace Dataset objects and pushed to the configured repository.

Key considerations:

Push is controlled by the push_to_huggingface pipeline parameter
Dataset ID on HuggingFace is configurable (defaults to llmtwin or llmtwin-dpo)
Requires a valid HUGGINGFACE_ACCESS_TOKEN in environment configuration
Both train and test splits are uploaded as separate dataset splits

Execution Diagram

GitHub URL

Workflow Repository