Implementation:Hiyouga LLaMA Factory Data Loader
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Training Pipeline |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Concrete end-to-end dataset loading, preprocessing, and tokenization orchestrator provided by LLaMA Factory.
Description
This module is the central data pipeline orchestrator for LLaMA Factory. The primary entry point get_dataset coordinates the entire flow from raw data sources to tokenized, training-ready datasets. The pipeline consists of:
- Loading --
_load_single_datasetloads datasets from HuggingFace Hub, ModelScope, OpenMind, local files, cloud storage, or scripts, with support for streaming, sampling, and truncation - Format Alignment -- Each loaded dataset is passed through
align_datasetto normalize to the standard schema - Merging --
_get_merged_datasetloads and merges multiple datasets with interleaving support - Processor Selection --
_get_dataset_processorselects the appropriate tokenization processor based on training stage (pt, sft, rm, ppo, kto) and packing mode - Preprocessing --
_get_preprocessed_datasetapplies the selected processor viadataset.map() - Splitting -- Train/eval splits are created and optionally saved as pre-tokenized datasets
The module supports six dataset processor types: PretrainDatasetProcessor, SupervisedDatasetProcessor, PackedSupervisedDatasetProcessor, PairwiseDatasetProcessor, FeedbackDatasetProcessor, and UnsupervisedDatasetProcessor.
Usage
get_dataset is called by every training workflow (SFT, DPO, KTO, PPO, PT, RM) at the start of training to prepare the data. It returns a DatasetModule dictionary containing the train and evaluation datasets.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/data/loader.py
- Lines: 1-336
Signature
def get_dataset(
template: "Template",
model_args: "ModelArguments",
data_args: "DataArguments",
training_args: "Seq2SeqTrainingArguments",
stage: Literal["pt", "sft", "rm", "ppo", "kto"],
tokenizer: "PreTrainedTokenizer",
processor: Optional["ProcessorMixin"] = None,
) -> "DatasetModule": ...
def _load_single_dataset(
dataset_attr: "DatasetAttr",
model_args: "ModelArguments",
data_args: "DataArguments",
training_args: "Seq2SeqTrainingArguments",
) -> Union["Dataset", "IterableDataset"]: ...
def _get_merged_dataset(
dataset_names: list[str] | None,
model_args: "ModelArguments",
data_args: "DataArguments",
training_args: "Seq2SeqTrainingArguments",
stage: Literal["pt", "sft", "rm", "ppo", "kto"],
return_dict: bool = False,
) -> Union["Dataset", "IterableDataset", dict[str, "Dataset"]] | None: ...
def _get_dataset_processor(
data_args: "DataArguments",
stage: Literal["pt", "sft", "rm", "ppo", "kto"],
template: "Template",
tokenizer: "PreTrainedTokenizer",
processor: Optional["ProcessorMixin"],
do_generate: bool = False,
) -> "DatasetProcessor": ...
Import
from llamafactory.data.loader import get_dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| template | Template | Yes | Chat template for tokenization |
| model_args | ModelArguments | Yes | Model path and hub tokens for dataset downloading |
| data_args | DataArguments | Yes | Dataset names, paths, streaming, preprocessing config |
| training_args | Seq2SeqTrainingArguments | Yes | Training configuration including num_proc, seed, logging |
| stage | str | Yes | Training stage: "pt", "sft", "rm", "ppo", or "kto" |
| tokenizer | PreTrainedTokenizer | Yes | Tokenizer for encoding text to token IDs |
| processor | ProcessorMixin | No | Multimodal processor for VLM models |
Outputs
| Name | Type | Description |
|---|---|---|
| DatasetModule | dict | Dictionary with "train_dataset" and optional "eval_dataset" keys containing tokenized HuggingFace datasets |
Usage Examples
from llamafactory.data.loader import get_dataset
# Called in training workflows:
dataset_module = get_dataset(
template=template,
model_args=model_args,
data_args=data_args,
training_args=training_args,
stage="sft",
tokenizer=tokenizer,
processor=processor,
)
train_dataset = dataset_module["train_dataset"]
eval_dataset = dataset_module.get("eval_dataset")
Related Pages
- Hiyouga_LLaMA_Factory_Data_Converter - Format alignment converters called during loading
- Hiyouga_LLaMA_Factory_Chat_Template - Template used during tokenization
- Hiyouga_LLaMA_Factory_Data_Collator - Collators that batch the preprocessed dataset for training
- Hiyouga_LLaMA_Factory_Constants - FILEEXT2TYPE mapping used for local file loading