Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Hiyouga LLaMA Factory Dataset Format Conversion

From Leeroopedia


Knowledge Sources
Domains Data Engineering, NLP
Last Updated 2026-02-06 19:00 GMT

Overview

Dataset format conversion in LLaMA-Factory is a standardized data pipeline that transforms heterogeneous dataset formats (Alpaca, ShareGPT, OpenAI) into a unified internal representation, enabling a single training pipeline to consume datasets from any supported source format.

Description

Training datasets for large language models come in many formats, reflecting the diverse ecosystem of data creation tools and communities. LLaMA-Factory addresses this by implementing a converter pattern where each input format has a dedicated converter that maps it to a standardized internal schema. This decouples format-specific parsing from the downstream processing pipeline (tokenization, packing, batching).

The pipeline consists of three stages:

1. Dataset parsing (DatasetAttr and get_dataset_list): The parser reads dataset configuration from a JSON registry file (dataset_info.json), resolving dataset sources (HuggingFace Hub, ModelScope, local files, cloud files, scripts) and column mappings. Each dataset entry specifies its format, column names, split, subset, and any special tags for role identification.

2. Format conversion (DatasetConverter hierarchy): Converters transform raw examples into the internal schema:

Internal Field Description
_prompt List of message dicts representing the conversation history
_response List of assistant response message dicts
_system System prompt string
_tools Tool definitions string
_images, _videos, _audios Multimodal media references

Three converters handle the major formats:

  • AlpacaDatasetConverter: Maps instruction/input/output/history fields to the message format.
  • SharegptDatasetConverter: Maps conversation arrays with role tags (human/gpt/system/function_call/observation) to the standardized format, with validation of alternating user/assistant turns.
  • OpenAIDatasetConverter: Handles OpenAI-style message arrays including tool call aggregation and function-calling message consolidation.

The v1 system uses a plugin architecture (DataConverterPlugin) that allows converters to be registered dynamically. Built-in plugins handle Alpaca, ShareGPT, OpenAI, and pair (DPO) formats with type-safe TypedDict definitions.

3. Dataset loading (get_dataset pipeline): The loader orchestrates loading from various sources, applying format conversion, merging multiple datasets with configurable mixing, splitting into train/eval sets, and preprocessing (tokenization) via stage-specific processors (pretrain, supervised, pairwise, feedback, unsupervised).

Usage

The dataset format conversion system is used in every training and evaluation run. Users interact with it by:

  • Defining datasets in dataset_info.json with appropriate format specifications.
  • Specifying column mappings when dataset field names differ from defaults (e.g., "columns": {"prompt": "question", "response": "answer"}).
  • Supporting ranking (pairwise) data by setting "ranking": true with chosen/rejected fields.
  • Supporting KTO data by setting a kto_tag field pointing to a boolean column.
  • Using the ONLINE dataset directory for direct HuggingFace Hub access without local configuration.

Theoretical Basis

The design follows the adapter pattern from software engineering, where incompatible interfaces are made compatible through intermediate adapters. The type hierarchy is:

DatasetConverter (abstract base)
  +-- AlpacaDatasetConverter
  +-- SharegptDatasetConverter
  +-- OpenAIDatasetConverter

Each converter implements a __call__ method that maps a single raw example to the internal schema:

@dataclass
class DatasetConverter:
    dataset_attr: DatasetAttr
    data_args: DataArguments

    @abstractmethod
    def __call__(self, example: dict[str, Any]) -> dict[str, Any]:
        ...

The Alpaca format follows the original Stanford Alpaca schema:

# Input: {"instruction": "...", "input": "...", "output": "...", "history": [...]}
# Output: {"_prompt": [{"role": "user", "content": "instruction\ninput"}],
#          "_response": [{"role": "assistant", "content": "output"}], ...}

The ShareGPT format handles multi-turn conversations with configurable role tags:

tag_mapping = {
    dataset_attr.user_tag: Role.USER.value,        # "human" -> "user"
    dataset_attr.assistant_tag: Role.ASSISTANT.value, # "gpt" -> "assistant"
    dataset_attr.observation_tag: Role.OBSERVATION.value,
    dataset_attr.function_tag: Role.FUNCTION.value,
    dataset_attr.system_tag: Role.SYSTEM.value,
}

Validation enforces alternating turn structure: odd-indexed messages must be user/observation tags, even-indexed must be assistant/function tags. Broken data is detected and skipped with warnings.

The OpenAI format adds tool-calling support, consolidating multiple tool role messages into single observation entries:

if "tool_calls" in message and len(message["tool_calls"]) > 0:
    tool_calls_list = [tool["function"] for tool in message["tool_calls"]]
    content = json.dumps(tool_calls_list, ensure_ascii=False)
    role = self.dataset_attr.function_tag

The v1 plugin system extends this with a registration pattern:

@DataConverterPlugin("alpaca").register()
def alpaca_converter(raw_sample: AlpacaSample) -> SFTSample:
    messages = []
    # ... conversion logic ...
    return {"messages": messages}

The v1 internal format uses a richer content-typed message schema where each content element has a type (text, tool_call) and value, along with a loss_weight field that controls per-message loss masking:

{"role": "assistant",
 "content": [{"type": "text", "value": "response text"}],
 "loss_weight": 1.0}  # compute loss on this message

The dataset loading pipeline handles multi-source merging with configurable dataset mixing:

# Load and align each dataset
for dataset_name, dataset_attr in zip(dataset_names, get_dataset_list(...)):
    datasets[dataset_name] = _load_single_dataset(dataset_attr, ...)

# Merge with optional interleaving
return merge_dataset(list(datasets.values()), data_args, seed=training_args.seed)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment