Implementation:PacktPublishing LLM Engineers Handbook DatasetGenerator Get Prompts

Aspect	Detail
API	`DatasetGenerator.get_prompts(cls, documents: list[CleanedDocument]) -> dict[DataCategory, list[GenerateDatasetSamplesPrompt]]`
Source	llm_engineering/application/dataset/generation.py:L51-91
Type	API Doc
Implements	Principle:PacktPublishing_LLM_Engineers_Handbook_Prompt_Engineering_For_Dataset_Generation

Summary

The get_prompts class method on DatasetGenerator (and its subclasses InstructionDatasetGenerator and PreferenceDatasetGenerator) transforms a list of cleaned documents into a dictionary of prompts grouped by data category. It first chunks documents into appropriately-sized extracts, groups them by category, then generates a GenerateDatasetSamplesPrompt for each document chunk. These prompts are ready to be fed to an LLM for synthetic dataset generation.

Source Code

@classmethod
def get_prompts(cls, documents: list[CleanedDocument]) -> dict[DataCategory, list[GenerateDatasetSamplesPrompt]]:
    documents = generation_utils.extract_substrings(documents)

    grouped_prompts = {}
    grouped_cleaned_documents = CleanedDocument.group_by_category(documents)

    for category, category_documents in grouped_cleaned_documents.items():
        category_prompts = [cls.get_prompt(document) for document in category_documents]
        grouped_prompts[category] = category_prompts

    return grouped_prompts

Import

from llm_engineering.application.dataset.generation import InstructionDatasetGenerator, PreferenceDatasetGenerator

Parameters

Parameter	Type	Default	Description
`documents`	`list[CleanedDocument]`	(required)	List of cleaned documents retrieved from the Qdrant feature store. These include `CleanedArticle`, `CleanedPost`, and `CleanedRepositoryDocument` instances.

Return Value

Type	Description
`dict[DataCategory, list[GenerateDatasetSamplesPrompt]]`	A dictionary mapping each `DataCategory` (e.g., `ARTICLES`, `POSTS`, `REPOSITORIES`) to a list of prompt objects. Each prompt encapsulates the formatted text, few-shot examples, and output schema instructions.

Behavior

The method executes the following steps:

Document chunking -- Calls generation_utils.extract_substrings(documents) to split long documents into smaller extracts that fit within the LLM's token window. This uses tiktoken for accurate token counting.
Category grouping -- Calls CleanedDocument.group_by_category(documents) to partition documents into groups keyed by DataCategory (articles, posts, repositories).
Prompt generation -- For each category, iterates over its documents and calls cls.get_prompt(document) to construct a GenerateDatasetSamplesPrompt. The specific prompt template depends on the subclass:
- InstructionDatasetGenerator.get_prompt() produces prompts targeting instruction-answer pairs
- PreferenceDatasetGenerator.get_prompt() produces prompts targeting instruction-rejected-chosen triples
Grouping -- Collects all prompts into the grouped_prompts dictionary keyed by category.

Internal Methods

The get_prompt(document) method (called per document) is responsible for:

Composing the system instruction (role description, task specification)
Inserting few-shot examples demonstrating desired output format
Specifying the JSON output schema that matches the target pydantic model
Embedding the document extract as the source material

Usage Example

from llm_engineering.application.dataset.generation import InstructionDatasetGenerator
from llm_engineering.domain.cleaned_documents import CleanedArticle

# Assume documents were retrieved from the feature store
documents = [...]  # list of CleanedDocument instances

# Generate instruction dataset prompts
instruction_prompts = InstructionDatasetGenerator.get_prompts(documents)

for category, prompts in instruction_prompts.items():
    print(f"Category {category.value}: {len(prompts)} prompts generated")

External Dependencies

Dependency	Purpose
`langchain`	Prompt template construction and formatting
`tiktoken`	Token counting for document chunking to respect LLM context limits
`loguru`	Structured logging of prompt generation progress

Design Notes

The method is a classmethod to allow InstructionDatasetGenerator and PreferenceDatasetGenerator subclasses to override get_prompt() while sharing the same overall orchestration logic.
Document chunking happens before category grouping so that all chunks are properly categorized.
The grouping by category enables category-specific batch processing in the downstream generation step, where different categories may benefit from different LLM parameters.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment