Implementation:PacktPublishing LLM Engineers Handbook DatasetGenerator Get Prompts
| Aspect | Detail |
|---|---|
| API | DatasetGenerator.get_prompts(cls, documents: list[CleanedDocument]) -> dict[DataCategory, list[GenerateDatasetSamplesPrompt]]
|
| Source | llm_engineering/application/dataset/generation.py:L51-91 |
| Type | API Doc |
| Implements | Principle:PacktPublishing_LLM_Engineers_Handbook_Prompt_Engineering_For_Dataset_Generation |
Summary
The get_prompts class method on DatasetGenerator (and its subclasses InstructionDatasetGenerator and PreferenceDatasetGenerator) transforms a list of cleaned documents into a dictionary of prompts grouped by data category. It first chunks documents into appropriately-sized extracts, groups them by category, then generates a GenerateDatasetSamplesPrompt for each document chunk. These prompts are ready to be fed to an LLM for synthetic dataset generation.
Source Code
@classmethod
def get_prompts(cls, documents: list[CleanedDocument]) -> dict[DataCategory, list[GenerateDatasetSamplesPrompt]]:
documents = generation_utils.extract_substrings(documents)
grouped_prompts = {}
grouped_cleaned_documents = CleanedDocument.group_by_category(documents)
for category, category_documents in grouped_cleaned_documents.items():
category_prompts = [cls.get_prompt(document) for document in category_documents]
grouped_prompts[category] = category_prompts
return grouped_prompts
Import
from llm_engineering.application.dataset.generation import InstructionDatasetGenerator, PreferenceDatasetGenerator
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
documents |
list[CleanedDocument] |
(required) | List of cleaned documents retrieved from the Qdrant feature store. These include CleanedArticle, CleanedPost, and CleanedRepositoryDocument instances.
|
Return Value
| Type | Description |
|---|---|
dict[DataCategory, list[GenerateDatasetSamplesPrompt]] |
A dictionary mapping each DataCategory (e.g., ARTICLES, POSTS, REPOSITORIES) to a list of prompt objects. Each prompt encapsulates the formatted text, few-shot examples, and output schema instructions.
|
Behavior
The method executes the following steps:
- Document chunking -- Calls
generation_utils.extract_substrings(documents)to split long documents into smaller extracts that fit within the LLM's token window. This uses tiktoken for accurate token counting. - Category grouping -- Calls
CleanedDocument.group_by_category(documents)to partition documents into groups keyed byDataCategory(articles, posts, repositories). - Prompt generation -- For each category, iterates over its documents and calls
cls.get_prompt(document)to construct aGenerateDatasetSamplesPrompt. The specific prompt template depends on the subclass:InstructionDatasetGenerator.get_prompt()produces prompts targeting instruction-answer pairsPreferenceDatasetGenerator.get_prompt()produces prompts targeting instruction-rejected-chosen triples
- Grouping -- Collects all prompts into the
grouped_promptsdictionary keyed by category.
Internal Methods
The get_prompt(document) method (called per document) is responsible for:
- Composing the system instruction (role description, task specification)
- Inserting few-shot examples demonstrating desired output format
- Specifying the JSON output schema that matches the target pydantic model
- Embedding the document extract as the source material
Usage Example
from llm_engineering.application.dataset.generation import InstructionDatasetGenerator
from llm_engineering.domain.cleaned_documents import CleanedArticle
# Assume documents were retrieved from the feature store
documents = [...] # list of CleanedDocument instances
# Generate instruction dataset prompts
instruction_prompts = InstructionDatasetGenerator.get_prompts(documents)
for category, prompts in instruction_prompts.items():
print(f"Category {category.value}: {len(prompts)} prompts generated")
External Dependencies
| Dependency | Purpose |
|---|---|
langchain |
Prompt template construction and formatting |
tiktoken |
Token counting for document chunking to respect LLM context limits |
loguru |
Structured logging of prompt generation progress |
Design Notes
- The method is a classmethod to allow
InstructionDatasetGeneratorandPreferenceDatasetGeneratorsubclasses to overrideget_prompt()while sharing the same overall orchestration logic. - Document chunking happens before category grouping so that all chunks are properly categorized.
- The grouping by category enables category-specific batch processing in the downstream generation step, where different categories may benefit from different LLM parameters.
See Also
- Principle:PacktPublishing_LLM_Engineers_Handbook_Prompt_Engineering_For_Dataset_Generation -- The principle this implementation realizes
- Implementation:PacktPublishing_LLM_Engineers_Handbook_VectorBaseDocument_Bulk_Find -- The preceding step that retrieves documents
- Implementation:PacktPublishing_LLM_Engineers_Handbook_DatasetGenerator_Generate -- The next step that feeds these prompts to the LLM
- Heuristic:PacktPublishing_LLM_Engineers_Handbook_Token_Window_Safety_Margin