Implementation:Axolotl ai cloud Axolotl Load Datasets

Knowledge Sources	Axolotl HuggingFace Datasets
Domains	Data_Preparation, NLP
Last Updated	2026-02-06 23:00 GMT

Overview

Concrete tool for loading, tokenizing, and preparing training datasets provided by the Axolotl framework.

Description

The load_datasets function is the primary entry point for dataset preparation in Axolotl. It orchestrates the full pipeline: loading the tokenizer, optionally loading a processor (for multimodal), delegating to prepare_datasets for SFT data or prepare_preference_datasets for RL data, and returning a TrainDatasetMeta namedtuple containing the prepared datasets and total training step count.

Internally, prepare_datasets (in utils/data/sft.py) handles individual dataset loading, prompt strategy application, tokenization, merging, deduplication, and train/eval splitting.

Usage

Import this function after configuration validation and before model loading. It is called as part of the train orchestration function to prepare all training data.

Code Reference

Source Location

Repository: axolotl
File: src/axolotl/common/datasets.py
Lines: L39-98

Signature

def load_datasets(
    *,
    cfg: DictDefault,
    cli_args: PreprocessCliArgs | TrainerCliArgs | None = None,
    debug: bool = False,
) -> TrainDatasetMeta:
    """Load and prepare training datasets based on configuration.

    Args:
        cfg: Full training configuration dictionary.
        cli_args: Optional CLI arguments for preprocessing or training.
        debug: Enable debug mode for dataset preparation.

    Returns:
        TrainDatasetMeta: Named tuple containing (train_dataset, eval_dataset,
        total_num_steps, prompters).
    """

Import

from axolotl.common.datasets import load_datasets

I/O Contract

Inputs

Name	Type	Required	Description
cfg	DictDefault	Yes	Full config with datasets list, tokenizer settings, sequence_len, val_set_size, etc.
cli_args	PreprocessCliArgs or TrainerCliArgs or None	No	CLI arguments controlling preprocessing behavior
debug	bool	No (default: False)	Enable debug mode for verbose dataset preparation logging

Outputs

Name	Type	Description
return	TrainDatasetMeta	Named tuple with fields: train_dataset (Dataset), eval_dataset (Dataset or None), total_num_steps (int), prompters (list)

Usage Examples

Basic Dataset Loading

from axolotl.cli.config import load_cfg
from axolotl.utils.config import validate_config
from axolotl.common.datasets import load_datasets

# Load and validate config
cfg = load_cfg("examples/llama-3/qlora-1b.yml")
cfg = validate_config(cfg)

# Load datasets
dataset_meta = load_datasets(cfg=cfg)

print(f"Training samples: {len(dataset_meta.train_dataset)}")
print(f"Eval samples: {len(dataset_meta.eval_dataset) if dataset_meta.eval_dataset else 0}")
print(f"Total training steps: {dataset_meta.total_num_steps}")

Dataset Loading for Preprocessing

from axolotl.common.datasets import load_datasets
from axolotl.cli.args import PreprocessCliArgs

cli_args = PreprocessCliArgs(download=True)
dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args, debug=True)

Related Pages

Implements Principle

Principle:Axolotl_ai_cloud_Axolotl_Dataset_Preparation

Requires Environment

Environment:Axolotl_ai_cloud_Axolotl_Python_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment