Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory V1 Data Converter

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Data Processing, Plugin Architecture
Last Updated 2026-02-06 19:00 GMT

Overview

DataConverterPlugin provides pluggable dataset format converters that transform Alpaca, ShareGPT, and pair-format raw samples into the standardized internal Message representation used by the training pipeline.

Description

The converter module defines a plugin system via DataConverterPlugin (extending BasePlugin) with three registered converters. The alpaca_converter transforms instruction/input/output fields into the Message format with system, user, and assistant roles. The sharegpt_converter handles ShareGPT-style conversations with role tag mapping (human/gpt/system/function_call/observation) and parses tool call JSON from function_call messages. The pair_converter processes chosen/rejected message pairs into DPO-compatible format with chosen_messages and rejected_messages fields. Each converter assigns appropriate loss_weight values (1.0 for assistant/trainable content, 0.0 for user/system content) and handles tool call parsing with error logging for malformed JSON.

Usage

Converters are invoked automatically by DataEngine when a dataset's YAML configuration specifies a converter field (e.g., converter: alpaca, converter: sharegpt, or converter: pair). To add a new dataset format, register a new converter function with @DataConverterPlugin("format_name").register().

Code Reference

Source Location

Signature

class AlpacaSample(TypedDict, total=False):
    system: NotRequired[str]
    instruction: str
    input: NotRequired[str]
    output: str

class SharegptSample(TypedDict, total=False):
    conversations: list[SharegptMessage]
    tools: NotRequired[str]

class OpenaiMessage(TypedDict, total=False):
    role: Literal["user", "assistant", "tool"]
    content: str

class OpenaiSample(TypedDict, total=False):
    messages: list[OpenaiMessage]

class PairSample(TypedDict, total=False):
    chosen: list[OpenaiMessage]
    rejected: list[OpenaiMessage]

class DataConverterPlugin(BasePlugin):
    def __call__(self, raw_sample: dict[str, Any]) -> Sample: ...

@DataConverterPlugin("alpaca").register()
def alpaca_converter(raw_sample: AlpacaSample) -> SFTSample: ...

@DataConverterPlugin("sharegpt").register()
def sharegpt_converter(raw_sample: SharegptSample) -> SFTSample: ...

@DataConverterPlugin("pair").register()
def pair_converter(raw_sample: PairSample) -> DPOSample: ...

Import

from llamafactory.v1.plugins.data_plugins.converter import DataConverterPlugin, alpaca_converter, sharegpt_converter, pair_converter

I/O Contract

Inputs (alpaca_converter)

Name Type Required Description
system str No Optional system prompt.
instruction str Yes The instruction text for the user message.
input str No Optional additional input text appended to instruction.
output str Yes The expected assistant response.

Inputs (sharegpt_converter)

Name Type Required Description
conversations list[SharegptMessage] Yes List of messages with "from" (human/gpt/system/function_call/observation) and "value" fields.
tools str No Optional JSON string of tool definitions.

Inputs (pair_converter)

Name Type Required Description
chosen list[OpenaiMessage] Yes List of messages for the preferred response.
rejected list[OpenaiMessage] Yes List of messages for the rejected response.
tools str No Optional JSON string of tool definitions.

Outputs

Name Type Description
alpaca_converter return SFTSample Dict with "messages" key containing standardized Message list with role, content, and loss_weight.
sharegpt_converter return SFTSample Dict with "messages" key and optional "tools" key. Tool calls are parsed into structured content entries.
pair_converter return DPOSample Dict with "chosen_messages" and "rejected_messages" keys for DPO training.

Usage Examples

from llamafactory.v1.plugins.data_plugins.converter import DataConverterPlugin

# Using via plugin system (as DataEngine does)
converter = DataConverterPlugin("alpaca")
sample = converter({"instruction": "Explain AI", "output": "AI is..."})
# Returns: {"messages": [{"role": "user", ...}, {"role": "assistant", ...}]}

# ShareGPT format
converter = DataConverterPlugin("sharegpt")
sample = converter({
    "conversations": [
        {"from": "human", "value": "Hello"},
        {"from": "gpt", "value": "Hi there!"},
    ]
})

# DPO pair format
converter = DataConverterPlugin("pair")
sample = converter({
    "chosen": [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Good response"},
    ],
    "rejected": [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Bad response"},
    ],
})

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment