Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PacktPublishing LLM Engineers Handbook DatasetGenerator Get Prompts

From Leeroopedia


Aspect Detail
API DatasetGenerator.get_prompts(cls, documents: list[CleanedDocument]) -> dict[DataCategory, list[GenerateDatasetSamplesPrompt]]
Source llm_engineering/application/dataset/generation.py:L51-91
Type API Doc
Implements Principle:PacktPublishing_LLM_Engineers_Handbook_Prompt_Engineering_For_Dataset_Generation

Summary

The get_prompts class method on DatasetGenerator (and its subclasses InstructionDatasetGenerator and PreferenceDatasetGenerator) transforms a list of cleaned documents into a dictionary of prompts grouped by data category. It first chunks documents into appropriately-sized extracts, groups them by category, then generates a GenerateDatasetSamplesPrompt for each document chunk. These prompts are ready to be fed to an LLM for synthetic dataset generation.

Source Code

@classmethod
def get_prompts(cls, documents: list[CleanedDocument]) -> dict[DataCategory, list[GenerateDatasetSamplesPrompt]]:
    documents = generation_utils.extract_substrings(documents)

    grouped_prompts = {}
    grouped_cleaned_documents = CleanedDocument.group_by_category(documents)

    for category, category_documents in grouped_cleaned_documents.items():
        category_prompts = [cls.get_prompt(document) for document in category_documents]
        grouped_prompts[category] = category_prompts

    return grouped_prompts

Import

from llm_engineering.application.dataset.generation import InstructionDatasetGenerator, PreferenceDatasetGenerator

Parameters

Parameter Type Default Description
documents list[CleanedDocument] (required) List of cleaned documents retrieved from the Qdrant feature store. These include CleanedArticle, CleanedPost, and CleanedRepositoryDocument instances.

Return Value

Type Description
dict[DataCategory, list[GenerateDatasetSamplesPrompt]] A dictionary mapping each DataCategory (e.g., ARTICLES, POSTS, REPOSITORIES) to a list of prompt objects. Each prompt encapsulates the formatted text, few-shot examples, and output schema instructions.

Behavior

The method executes the following steps:

  1. Document chunking -- Calls generation_utils.extract_substrings(documents) to split long documents into smaller extracts that fit within the LLM's token window. This uses tiktoken for accurate token counting.
  2. Category grouping -- Calls CleanedDocument.group_by_category(documents) to partition documents into groups keyed by DataCategory (articles, posts, repositories).
  3. Prompt generation -- For each category, iterates over its documents and calls cls.get_prompt(document) to construct a GenerateDatasetSamplesPrompt. The specific prompt template depends on the subclass:
    • InstructionDatasetGenerator.get_prompt() produces prompts targeting instruction-answer pairs
    • PreferenceDatasetGenerator.get_prompt() produces prompts targeting instruction-rejected-chosen triples
  4. Grouping -- Collects all prompts into the grouped_prompts dictionary keyed by category.

Internal Methods

The get_prompt(document) method (called per document) is responsible for:

  • Composing the system instruction (role description, task specification)
  • Inserting few-shot examples demonstrating desired output format
  • Specifying the JSON output schema that matches the target pydantic model
  • Embedding the document extract as the source material

Usage Example

from llm_engineering.application.dataset.generation import InstructionDatasetGenerator
from llm_engineering.domain.cleaned_documents import CleanedArticle

# Assume documents were retrieved from the feature store
documents = [...]  # list of CleanedDocument instances

# Generate instruction dataset prompts
instruction_prompts = InstructionDatasetGenerator.get_prompts(documents)

for category, prompts in instruction_prompts.items():
    print(f"Category {category.value}: {len(prompts)} prompts generated")

External Dependencies

Dependency Purpose
langchain Prompt template construction and formatting
tiktoken Token counting for document chunking to respect LLM context limits
loguru Structured logging of prompt generation progress

Design Notes

  • The method is a classmethod to allow InstructionDatasetGenerator and PreferenceDatasetGenerator subclasses to override get_prompt() while sharing the same overall orchestration logic.
  • Document chunking happens before category grouping so that all chunks are properly categorized.
  • The grouping by category enables category-specific batch processing in the downstream generation step, where different categories may benefit from different LLM parameters.

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment