Implementation:Hiyouga LLaMA Factory V1 Data Converter

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Machine Learning, Data Processing, Plugin Architecture
Last Updated	2026-02-06 19:00 GMT

Overview

DataConverterPlugin provides pluggable dataset format converters that transform Alpaca, ShareGPT, and pair-format raw samples into the standardized internal Message representation used by the training pipeline.

Description

The converter module defines a plugin system via DataConverterPlugin (extending BasePlugin) with three registered converters. The alpaca_converter transforms instruction/input/output fields into the Message format with system, user, and assistant roles. The sharegpt_converter handles ShareGPT-style conversations with role tag mapping (human/gpt/system/function_call/observation) and parses tool call JSON from function_call messages. The pair_converter processes chosen/rejected message pairs into DPO-compatible format with chosen_messages and rejected_messages fields. Each converter assigns appropriate loss_weight values (1.0 for assistant/trainable content, 0.0 for user/system content) and handles tool call parsing with error logging for malformed JSON.

Usage

Converters are invoked automatically by DataEngine when a dataset's YAML configuration specifies a converter field (e.g., converter: alpaca, converter: sharegpt, or converter: pair). To add a new dataset format, register a new converter function with @DataConverterPlugin("format_name").register().

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/v1/plugins/data_plugins/converter.py
Lines: 1-223

Signature

class AlpacaSample(TypedDict, total=False):
    system: NotRequired[str]
    instruction: str
    input: NotRequired[str]
    output: str

class SharegptSample(TypedDict, total=False):
    conversations: list[SharegptMessage]
    tools: NotRequired[str]

class OpenaiMessage(TypedDict, total=False):
    role: Literal["user", "assistant", "tool"]
    content: str

class OpenaiSample(TypedDict, total=False):
    messages: list[OpenaiMessage]

class PairSample(TypedDict, total=False):
    chosen: list[OpenaiMessage]
    rejected: list[OpenaiMessage]

class DataConverterPlugin(BasePlugin):
    def __call__(self, raw_sample: dict[str, Any]) -> Sample: ...

@DataConverterPlugin("alpaca").register()
def alpaca_converter(raw_sample: AlpacaSample) -> SFTSample: ...

@DataConverterPlugin("sharegpt").register()
def sharegpt_converter(raw_sample: SharegptSample) -> SFTSample: ...

@DataConverterPlugin("pair").register()
def pair_converter(raw_sample: PairSample) -> DPOSample: ...

Import

from llamafactory.v1.plugins.data_plugins.converter import DataConverterPlugin, alpaca_converter, sharegpt_converter, pair_converter

I/O Contract

Inputs (alpaca_converter)

Name	Type	Required	Description
system	str	No	Optional system prompt.
instruction	str	Yes	The instruction text for the user message.
input	str	No	Optional additional input text appended to instruction.
output	str	Yes	The expected assistant response.

Inputs (sharegpt_converter)

Name	Type	Required	Description
conversations	list[SharegptMessage]	Yes	List of messages with "from" (human/gpt/system/function_call/observation) and "value" fields.
tools	str	No	Optional JSON string of tool definitions.

Inputs (pair_converter)

Name	Type	Required	Description
chosen	list[OpenaiMessage]	Yes	List of messages for the preferred response.
rejected	list[OpenaiMessage]	Yes	List of messages for the rejected response.
tools	str	No	Optional JSON string of tool definitions.

Outputs

Name	Type	Description
alpaca_converter return	SFTSample	Dict with "messages" key containing standardized Message list with role, content, and loss_weight.
sharegpt_converter return	SFTSample	Dict with "messages" key and optional "tools" key. Tool calls are parsed into structured content entries.
pair_converter return	DPOSample	Dict with "chosen_messages" and "rejected_messages" keys for DPO training.

Usage Examples

from llamafactory.v1.plugins.data_plugins.converter import DataConverterPlugin

# Using via plugin system (as DataEngine does)
converter = DataConverterPlugin("alpaca")
sample = converter({"instruction": "Explain AI", "output": "AI is..."})
# Returns: {"messages": [{"role": "user", ...}, {"role": "assistant", ...}]}

# ShareGPT format
converter = DataConverterPlugin("sharegpt")
sample = converter({
    "conversations": [
        {"from": "human", "value": "Hello"},
        {"from": "gpt", "value": "Hi there!"},
    ]
})

# DPO pair format
converter = DataConverterPlugin("pair")
sample = converter({
    "chosen": [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Good response"},
    ],
    "rejected": [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Bad response"},
    ],
})

Related Pages

Hiyouga_LLaMA_Factory_V1_Data_Engine - Invokes converters during sample retrieval based on dataset config.
Hiyouga_LLaMA_Factory_V1_Rendering - Consumes the standardized Message format produced by converters.
Hiyouga_LLaMA_Factory_V1_Data_Loader - Sibling plugin for data loading.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment