Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Run llama Llama index Training Data Collection

From Leeroopedia

Overview

Training Data Collection is the foundational step in the LLM finetuning workflow. Before a model can be finetuned on domain-specific knowledge, a representative dataset of query-response pairs must be captured from actual LLM interactions. LlamaIndex provides a callback-based mechanism that transparently intercepts all messages exchanged with the LLM during normal application usage -- including system prompts, user queries, and assistant responses -- and accumulates them into structured training examples.

The core idea is passive collection: rather than manually crafting training examples, you instrument your existing RAG pipeline with a callback handler that listens to every LLM event. As your application processes queries against your knowledge base, the handler silently records each conversation turn. This approach ensures the training data reflects real usage patterns and the actual response quality of the teacher model (e.g., GPT-4), which can then be distilled into a smaller, cheaper student model (e.g., GPT-3.5-turbo).

Callback-Based Event Capture

LlamaIndex's callback system follows the observer pattern. A callback handler registers itself with the global CallbackManager and receives notifications for every LLM event in the pipeline:

  • on_event_start: Fired when an LLM call begins, carrying the input messages (system prompt, user query, prior conversation context)
  • on_event_end: Fired when the LLM responds, carrying the assistant's response message

The handler maintains an internal dictionary keyed by event ID, accumulating input messages on start and appending the response on end. This event-based architecture means data collection is:

  • Non-intrusive: No changes to query logic or response generation
  • Comprehensive: Captures every LLM interaction including sub-queries in complex pipelines
  • Structured: Each event is a complete conversation with messages and response already separated

Converting to Training Format

OpenAI's finetuning API expects training data in JSONL format (JSON Lines), where each line is a self-contained JSON object with a messages array:

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

The conversion process involves:

  • Message serialization: Converting LlamaIndex's internal ChatMessage objects into OpenAI-compatible dictionary format using to_openai_message_dicts
  • Function call handling: If the LLM interaction involved function/tool calls, these are preserved in the training data so the finetuned model can learn tool-use patterns
  • File output: Writing all accumulated examples to a single .jsonl file with one training example per line

Workflow Integration

A typical finetuning data collection workflow proceeds as follows:

Step Action Component
1 Create the finetuning handler OpenAIFineTuningHandler
2 Register handler with callback manager CallbackManager([handler])
3 Assign callback manager to Settings Settings.callback_manager
4 Run queries through your RAG pipeline Normal query engine usage
5 Save collected events to JSONL handler.save_finetuning_events(path)
from llama_index.finetuning import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
from llama_index.core import Settings

# Set up collection
finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])
Settings.callback_manager = callback_manager

# Run queries (data is collected automatically)
for question in questions:
    response = query_engine.query(question)

# Save training data
finetuning_handler.save_finetuning_events("training_data.jsonl")

Key Considerations

  • Teacher model quality: The collected responses are only as good as the model generating them. Use the highest-quality model available (e.g., GPT-4) as the teacher
  • Query diversity: Ensure the questions cover the full range of expected user queries to avoid overfitting on narrow patterns
  • Volume requirements: OpenAI recommends at least 10 examples, with 50-100 being ideal for noticeable improvements
  • Function calls: If your pipeline uses tool/function calling, the handler preserves these in the training data, enabling the finetuned model to learn tool usage patterns

Knowledge Sources

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment