Overview
DataConverterPlugin provides pluggable dataset format converters that transform Alpaca, ShareGPT, and pair-format raw samples into the standardized internal Message representation used by the training pipeline.
Description
The converter module defines a plugin system via DataConverterPlugin (extending BasePlugin) with three registered converters. The alpaca_converter transforms instruction/input/output fields into the Message format with system, user, and assistant roles. The sharegpt_converter handles ShareGPT-style conversations with role tag mapping (human/gpt/system/function_call/observation) and parses tool call JSON from function_call messages. The pair_converter processes chosen/rejected message pairs into DPO-compatible format with chosen_messages and rejected_messages fields. Each converter assigns appropriate loss_weight values (1.0 for assistant/trainable content, 0.0 for user/system content) and handles tool call parsing with error logging for malformed JSON.
Usage
Converters are invoked automatically by DataEngine when a dataset's YAML configuration specifies a converter field (e.g., converter: alpaca, converter: sharegpt, or converter: pair). To add a new dataset format, register a new converter function with @DataConverterPlugin("format_name").register().
Code Reference
Source Location
Signature
class AlpacaSample(TypedDict, total=False):
system: NotRequired[str]
instruction: str
input: NotRequired[str]
output: str
class SharegptSample(TypedDict, total=False):
conversations: list[SharegptMessage]
tools: NotRequired[str]
class OpenaiMessage(TypedDict, total=False):
role: Literal["user", "assistant", "tool"]
content: str
class OpenaiSample(TypedDict, total=False):
messages: list[OpenaiMessage]
class PairSample(TypedDict, total=False):
chosen: list[OpenaiMessage]
rejected: list[OpenaiMessage]
class DataConverterPlugin(BasePlugin):
def __call__(self, raw_sample: dict[str, Any]) -> Sample: ...
@DataConverterPlugin("alpaca").register()
def alpaca_converter(raw_sample: AlpacaSample) -> SFTSample: ...
@DataConverterPlugin("sharegpt").register()
def sharegpt_converter(raw_sample: SharegptSample) -> SFTSample: ...
@DataConverterPlugin("pair").register()
def pair_converter(raw_sample: PairSample) -> DPOSample: ...
Import
from llamafactory.v1.plugins.data_plugins.converter import DataConverterPlugin, alpaca_converter, sharegpt_converter, pair_converter
I/O Contract
Inputs (alpaca_converter)
| Name |
Type |
Required |
Description
|
| system |
str |
No |
Optional system prompt.
|
| instruction |
str |
Yes |
The instruction text for the user message.
|
| input |
str |
No |
Optional additional input text appended to instruction.
|
| output |
str |
Yes |
The expected assistant response.
|
Inputs (sharegpt_converter)
| Name |
Type |
Required |
Description
|
| conversations |
list[SharegptMessage] |
Yes |
List of messages with "from" (human/gpt/system/function_call/observation) and "value" fields.
|
| tools |
str |
No |
Optional JSON string of tool definitions.
|
Inputs (pair_converter)
| Name |
Type |
Required |
Description
|
| chosen |
list[OpenaiMessage] |
Yes |
List of messages for the preferred response.
|
| rejected |
list[OpenaiMessage] |
Yes |
List of messages for the rejected response.
|
| tools |
str |
No |
Optional JSON string of tool definitions.
|
Outputs
| Name |
Type |
Description
|
| alpaca_converter return |
SFTSample |
Dict with "messages" key containing standardized Message list with role, content, and loss_weight.
|
| sharegpt_converter return |
SFTSample |
Dict with "messages" key and optional "tools" key. Tool calls are parsed into structured content entries.
|
| pair_converter return |
DPOSample |
Dict with "chosen_messages" and "rejected_messages" keys for DPO training.
|
Usage Examples
from llamafactory.v1.plugins.data_plugins.converter import DataConverterPlugin
# Using via plugin system (as DataEngine does)
converter = DataConverterPlugin("alpaca")
sample = converter({"instruction": "Explain AI", "output": "AI is..."})
# Returns: {"messages": [{"role": "user", ...}, {"role": "assistant", ...}]}
# ShareGPT format
converter = DataConverterPlugin("sharegpt")
sample = converter({
"conversations": [
{"from": "human", "value": "Hello"},
{"from": "gpt", "value": "Hi there!"},
]
})
# DPO pair format
converter = DataConverterPlugin("pair")
sample = converter({
"chosen": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Good response"},
],
"rejected": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Bad response"},
],
})
Related Pages