Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Axolotl ai cloud Axolotl Load Datasets

From Leeroopedia


Knowledge Sources
Domains Data_Preparation, NLP
Last Updated 2026-02-06 23:00 GMT

Overview

Concrete tool for loading, tokenizing, and preparing training datasets provided by the Axolotl framework.

Description

The load_datasets function is the primary entry point for dataset preparation in Axolotl. It orchestrates the full pipeline: loading the tokenizer, optionally loading a processor (for multimodal), delegating to prepare_datasets for SFT data or prepare_preference_datasets for RL data, and returning a TrainDatasetMeta namedtuple containing the prepared datasets and total training step count.

Internally, prepare_datasets (in utils/data/sft.py) handles individual dataset loading, prompt strategy application, tokenization, merging, deduplication, and train/eval splitting.

Usage

Import this function after configuration validation and before model loading. It is called as part of the train orchestration function to prepare all training data.

Code Reference

Source Location

  • Repository: axolotl
  • File: src/axolotl/common/datasets.py
  • Lines: L39-98

Signature

def load_datasets(
    *,
    cfg: DictDefault,
    cli_args: PreprocessCliArgs | TrainerCliArgs | None = None,
    debug: bool = False,
) -> TrainDatasetMeta:
    """Load and prepare training datasets based on configuration.

    Args:
        cfg: Full training configuration dictionary.
        cli_args: Optional CLI arguments for preprocessing or training.
        debug: Enable debug mode for dataset preparation.

    Returns:
        TrainDatasetMeta: Named tuple containing (train_dataset, eval_dataset,
        total_num_steps, prompters).
    """

Import

from axolotl.common.datasets import load_datasets

I/O Contract

Inputs

Name Type Required Description
cfg DictDefault Yes Full config with datasets list, tokenizer settings, sequence_len, val_set_size, etc.
cli_args PreprocessCliArgs or TrainerCliArgs or None No CLI arguments controlling preprocessing behavior
debug bool No (default: False) Enable debug mode for verbose dataset preparation logging

Outputs

Name Type Description
return TrainDatasetMeta Named tuple with fields: train_dataset (Dataset), eval_dataset (Dataset or None), total_num_steps (int), prompters (list)

Usage Examples

Basic Dataset Loading

from axolotl.cli.config import load_cfg
from axolotl.utils.config import validate_config
from axolotl.common.datasets import load_datasets

# Load and validate config
cfg = load_cfg("examples/llama-3/qlora-1b.yml")
cfg = validate_config(cfg)

# Load datasets
dataset_meta = load_datasets(cfg=cfg)

print(f"Training samples: {len(dataset_meta.train_dataset)}")
print(f"Eval samples: {len(dataset_meta.eval_dataset) if dataset_meta.eval_dataset else 0}")
print(f"Total training steps: {dataset_meta.total_num_steps}")

Dataset Loading for Preprocessing

from axolotl.common.datasets import load_datasets
from axolotl.cli.args import PreprocessCliArgs

cli_args = PreprocessCliArgs(download=True)
dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args, debug=True)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment