Implementation:Hiyouga LLaMA Factory Data Converter
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Dataset Normalization |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Concrete dataset format converters for normalizing Alpaca, ShareGPT, and OpenAI datasets into a unified schema provided by LLaMA Factory.
Description
This module defines an abstract DatasetConverter base class and three concrete implementations that transform diverse dataset formats into a standardized internal representation. The converters handle:
- AlpacaDatasetConverter -- Converts Alpaca-format datasets with prompt/query/response/history fields
- SharegptDatasetConverter -- Converts ShareGPT-format multi-turn conversation datasets with role-tagged messages
- OpenAIDatasetConverter -- Converts OpenAI-format datasets with tool_calls, tool responses, and thinking mode annotations
Each converter maps input fields to a standard schema: _prompt (conversation history), _response (model output), _system (system prompt), _tools (tool definitions), _images, _videos, and _audios (multimodal inputs). The converters also support pairwise (DPO/ranking) and KTO training formats by encoding chosen/rejected responses appropriately.
A registry pattern with DATASET_CONVERTERS, register_dataset_converter, and get_dataset_converter allows extensibility.
Usage
Converters are applied during the dataset loading phase via the align_dataset function, which calls dataset.map(converter) to transform every example in a loaded dataset. The converter to use is determined by the formatting field in the dataset configuration.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/data/converter.py
- Lines: 1-425
Signature
@dataclass
class DatasetConverter:
dataset_attr: "DatasetAttr"
data_args: "DataArguments"
def _find_medias(self, medias: Union["MediaType", list["MediaType"], None]) -> list["MediaType"] | None: ...
@abstractmethod
def __call__(self, example: dict[str, Any]) -> dict[str, Any]: ...
@dataclass
class AlpacaDatasetConverter(DatasetConverter):
def __call__(self, example: dict[str, Any]) -> dict[str, Any]: ...
@dataclass
class SharegptDatasetConverter(DatasetConverter):
def __call__(self, example: dict[str, Any]) -> dict[str, Any]: ...
@dataclass
class OpenAIDatasetConverter(DatasetConverter):
def __call__(self, example: dict[str, Any]) -> dict[str, Any]: ...
def align_dataset(
dataset: Union["Dataset", "IterableDataset"],
dataset_attr: "DatasetAttr",
data_args: "DataArguments",
training_args: "Seq2SeqTrainingArguments",
) -> Union["Dataset", "IterableDataset"]: ...
Import
from llamafactory.data.converter import (
AlpacaDatasetConverter,
SharegptDatasetConverter,
OpenAIDatasetConverter,
align_dataset,
register_dataset_converter,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| example | dict[str, Any] | Yes | A single dataset example in its original format (Alpaca, ShareGPT, or OpenAI) |
| dataset_attr | DatasetAttr | Yes | Dataset metadata defining field mappings (column names for prompt, response, system, etc.) |
| data_args | DataArguments | Yes | Data arguments including media_dir for resolving media file paths |
Outputs
| Name | Type | Description |
|---|---|---|
| _prompt | list[dict[str, str]] | Conversation history as role/content message dicts |
| _response | list[dict[str, str]] | Model response(s); multiple entries for pairwise/KTO data |
| _system | str | System prompt text |
| _tools | str | Tool definitions JSON string |
| _images | list[str] or None | Paths or URLs of image inputs |
| _videos | list[str] or None | Paths or URLs of video inputs |
| _audios | list[str] or None | Paths or URLs of audio inputs |
Usage Examples
from llamafactory.data.converter import align_dataset
# Typically called internally by the data loader:
aligned_dataset = align_dataset(
dataset=raw_dataset,
dataset_attr=dataset_attr, # contains formatting="sharegpt"
data_args=data_args,
training_args=training_args,
)
# Each example now has _prompt, _response, _system, _tools, _images, _videos, _audios
# Register a custom converter:
from llamafactory.data.converter import register_dataset_converter, DatasetConverter
@dataclass
class MyCustomConverter(DatasetConverter):
def __call__(self, example):
return {
"_prompt": [{"role": "user", "content": example["input"]}],
"_response": [{"role": "assistant", "content": example["output"]}],
"_system": "", "_tools": "",
"_images": None, "_videos": None, "_audios": None,
}
register_dataset_converter("custom", MyCustomConverter)
Related Pages
- Hiyouga_LLaMA_Factory_Data_Loader - Orchestrator that calls align_dataset during loading
- Hiyouga_LLaMA_Factory_Chat_Template - Template that consumes the standardized format during tokenization
- Hiyouga_LLaMA_Factory_Tool_Utils - Tool formatting utilities referenced by OpenAIDatasetConverter