Principle:Hiyouga LLaMA Factory Dataset Format Conversion
| Knowledge Sources | |
|---|---|
| Domains | Data Engineering, NLP |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Dataset format conversion in LLaMA-Factory is a standardized data pipeline that transforms heterogeneous dataset formats (Alpaca, ShareGPT, OpenAI) into a unified internal representation, enabling a single training pipeline to consume datasets from any supported source format.
Description
Training datasets for large language models come in many formats, reflecting the diverse ecosystem of data creation tools and communities. LLaMA-Factory addresses this by implementing a converter pattern where each input format has a dedicated converter that maps it to a standardized internal schema. This decouples format-specific parsing from the downstream processing pipeline (tokenization, packing, batching).
The pipeline consists of three stages:
1. Dataset parsing (DatasetAttr and get_dataset_list): The parser reads dataset configuration from a JSON registry file (dataset_info.json), resolving dataset sources (HuggingFace Hub, ModelScope, local files, cloud files, scripts) and column mappings. Each dataset entry specifies its format, column names, split, subset, and any special tags for role identification.
2. Format conversion (DatasetConverter hierarchy): Converters transform raw examples into the internal schema:
| Internal Field | Description |
|---|---|
_prompt |
List of message dicts representing the conversation history |
_response |
List of assistant response message dicts |
_system |
System prompt string |
_tools |
Tool definitions string |
_images, _videos, _audios |
Multimodal media references |
Three converters handle the major formats:
- AlpacaDatasetConverter: Maps instruction/input/output/history fields to the message format.
- SharegptDatasetConverter: Maps conversation arrays with role tags (human/gpt/system/function_call/observation) to the standardized format, with validation of alternating user/assistant turns.
- OpenAIDatasetConverter: Handles OpenAI-style message arrays including tool call aggregation and function-calling message consolidation.
The v1 system uses a plugin architecture (DataConverterPlugin) that allows converters to be registered dynamically. Built-in plugins handle Alpaca, ShareGPT, OpenAI, and pair (DPO) formats with type-safe TypedDict definitions.
3. Dataset loading (get_dataset pipeline): The loader orchestrates loading from various sources, applying format conversion, merging multiple datasets with configurable mixing, splitting into train/eval sets, and preprocessing (tokenization) via stage-specific processors (pretrain, supervised, pairwise, feedback, unsupervised).
Usage
The dataset format conversion system is used in every training and evaluation run. Users interact with it by:
- Defining datasets in
dataset_info.jsonwith appropriate format specifications. - Specifying column mappings when dataset field names differ from defaults (e.g.,
"columns": {"prompt": "question", "response": "answer"}). - Supporting ranking (pairwise) data by setting
"ranking": truewith chosen/rejected fields. - Supporting KTO data by setting a
kto_tagfield pointing to a boolean column. - Using the
ONLINEdataset directory for direct HuggingFace Hub access without local configuration.
Theoretical Basis
The design follows the adapter pattern from software engineering, where incompatible interfaces are made compatible through intermediate adapters. The type hierarchy is:
DatasetConverter (abstract base)
+-- AlpacaDatasetConverter
+-- SharegptDatasetConverter
+-- OpenAIDatasetConverter
Each converter implements a __call__ method that maps a single raw example to the internal schema:
@dataclass
class DatasetConverter:
dataset_attr: DatasetAttr
data_args: DataArguments
@abstractmethod
def __call__(self, example: dict[str, Any]) -> dict[str, Any]:
...
The Alpaca format follows the original Stanford Alpaca schema:
# Input: {"instruction": "...", "input": "...", "output": "...", "history": [...]}
# Output: {"_prompt": [{"role": "user", "content": "instruction\ninput"}],
# "_response": [{"role": "assistant", "content": "output"}], ...}
The ShareGPT format handles multi-turn conversations with configurable role tags:
tag_mapping = {
dataset_attr.user_tag: Role.USER.value, # "human" -> "user"
dataset_attr.assistant_tag: Role.ASSISTANT.value, # "gpt" -> "assistant"
dataset_attr.observation_tag: Role.OBSERVATION.value,
dataset_attr.function_tag: Role.FUNCTION.value,
dataset_attr.system_tag: Role.SYSTEM.value,
}
Validation enforces alternating turn structure: odd-indexed messages must be user/observation tags, even-indexed must be assistant/function tags. Broken data is detected and skipped with warnings.
The OpenAI format adds tool-calling support, consolidating multiple tool role messages into single observation entries:
if "tool_calls" in message and len(message["tool_calls"]) > 0:
tool_calls_list = [tool["function"] for tool in message["tool_calls"]]
content = json.dumps(tool_calls_list, ensure_ascii=False)
role = self.dataset_attr.function_tag
The v1 plugin system extends this with a registration pattern:
@DataConverterPlugin("alpaca").register()
def alpaca_converter(raw_sample: AlpacaSample) -> SFTSample:
messages = []
# ... conversion logic ...
return {"messages": messages}
The v1 internal format uses a richer content-typed message schema where each content element has a type (text, tool_call) and value, along with a loss_weight field that controls per-message loss masking:
{"role": "assistant",
"content": [{"type": "text", "value": "response text"}],
"loss_weight": 1.0} # compute loss on this message
The dataset loading pipeline handles multi-source merging with configurable dataset mixing:
# Load and align each dataset
for dataset_name, dataset_attr in zip(dataset_names, get_dataset_list(...)):
datasets[dataset_name] = _load_single_dataset(dataset_attr, ...)
# Merge with optional interleaving
return merge_dataset(list(datasets.values()), data_args, seed=training_args.seed)