Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory Data Converter

From Leeroopedia


Knowledge Sources
Domains Data Processing, Dataset Normalization
Last Updated 2026-02-06 19:00 GMT

Overview

Concrete dataset format converters for normalizing Alpaca, ShareGPT, and OpenAI datasets into a unified schema provided by LLaMA Factory.

Description

This module defines an abstract DatasetConverter base class and three concrete implementations that transform diverse dataset formats into a standardized internal representation. The converters handle:

  • AlpacaDatasetConverter -- Converts Alpaca-format datasets with prompt/query/response/history fields
  • SharegptDatasetConverter -- Converts ShareGPT-format multi-turn conversation datasets with role-tagged messages
  • OpenAIDatasetConverter -- Converts OpenAI-format datasets with tool_calls, tool responses, and thinking mode annotations

Each converter maps input fields to a standard schema: _prompt (conversation history), _response (model output), _system (system prompt), _tools (tool definitions), _images, _videos, and _audios (multimodal inputs). The converters also support pairwise (DPO/ranking) and KTO training formats by encoding chosen/rejected responses appropriately.

A registry pattern with DATASET_CONVERTERS, register_dataset_converter, and get_dataset_converter allows extensibility.

Usage

Converters are applied during the dataset loading phase via the align_dataset function, which calls dataset.map(converter) to transform every example in a loaded dataset. The converter to use is determined by the formatting field in the dataset configuration.

Code Reference

Source Location

Signature

@dataclass
class DatasetConverter:
    dataset_attr: "DatasetAttr"
    data_args: "DataArguments"
    def _find_medias(self, medias: Union["MediaType", list["MediaType"], None]) -> list["MediaType"] | None: ...
    @abstractmethod
    def __call__(self, example: dict[str, Any]) -> dict[str, Any]: ...

@dataclass
class AlpacaDatasetConverter(DatasetConverter):
    def __call__(self, example: dict[str, Any]) -> dict[str, Any]: ...

@dataclass
class SharegptDatasetConverter(DatasetConverter):
    def __call__(self, example: dict[str, Any]) -> dict[str, Any]: ...

@dataclass
class OpenAIDatasetConverter(DatasetConverter):
    def __call__(self, example: dict[str, Any]) -> dict[str, Any]: ...

def align_dataset(
    dataset: Union["Dataset", "IterableDataset"],
    dataset_attr: "DatasetAttr",
    data_args: "DataArguments",
    training_args: "Seq2SeqTrainingArguments",
) -> Union["Dataset", "IterableDataset"]: ...

Import

from llamafactory.data.converter import (
    AlpacaDatasetConverter,
    SharegptDatasetConverter,
    OpenAIDatasetConverter,
    align_dataset,
    register_dataset_converter,
)

I/O Contract

Inputs

Name Type Required Description
example dict[str, Any] Yes A single dataset example in its original format (Alpaca, ShareGPT, or OpenAI)
dataset_attr DatasetAttr Yes Dataset metadata defining field mappings (column names for prompt, response, system, etc.)
data_args DataArguments Yes Data arguments including media_dir for resolving media file paths

Outputs

Name Type Description
_prompt list[dict[str, str]] Conversation history as role/content message dicts
_response list[dict[str, str]] Model response(s); multiple entries for pairwise/KTO data
_system str System prompt text
_tools str Tool definitions JSON string
_images list[str] or None Paths or URLs of image inputs
_videos list[str] or None Paths or URLs of video inputs
_audios list[str] or None Paths or URLs of audio inputs

Usage Examples

from llamafactory.data.converter import align_dataset

# Typically called internally by the data loader:
aligned_dataset = align_dataset(
    dataset=raw_dataset,
    dataset_attr=dataset_attr,  # contains formatting="sharegpt"
    data_args=data_args,
    training_args=training_args,
)
# Each example now has _prompt, _response, _system, _tools, _images, _videos, _audios

# Register a custom converter:
from llamafactory.data.converter import register_dataset_converter, DatasetConverter

@dataclass
class MyCustomConverter(DatasetConverter):
    def __call__(self, example):
        return {
            "_prompt": [{"role": "user", "content": example["input"]}],
            "_response": [{"role": "assistant", "content": example["output"]}],
            "_system": "", "_tools": "",
            "_images": None, "_videos": None, "_audios": None,
        }

register_dataset_converter("custom", MyCustomConverter)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment