Implementation:Axolotl ai cloud Axolotl Load Datasets
| Knowledge Sources | |
|---|---|
| Domains | Data_Preparation, NLP |
| Last Updated | 2026-02-06 23:00 GMT |
Overview
Concrete tool for loading, tokenizing, and preparing training datasets provided by the Axolotl framework.
Description
The load_datasets function is the primary entry point for dataset preparation in Axolotl. It orchestrates the full pipeline: loading the tokenizer, optionally loading a processor (for multimodal), delegating to prepare_datasets for SFT data or prepare_preference_datasets for RL data, and returning a TrainDatasetMeta namedtuple containing the prepared datasets and total training step count.
Internally, prepare_datasets (in utils/data/sft.py) handles individual dataset loading, prompt strategy application, tokenization, merging, deduplication, and train/eval splitting.
Usage
Import this function after configuration validation and before model loading. It is called as part of the train orchestration function to prepare all training data.
Code Reference
Source Location
- Repository: axolotl
- File: src/axolotl/common/datasets.py
- Lines: L39-98
Signature
def load_datasets(
*,
cfg: DictDefault,
cli_args: PreprocessCliArgs | TrainerCliArgs | None = None,
debug: bool = False,
) -> TrainDatasetMeta:
"""Load and prepare training datasets based on configuration.
Args:
cfg: Full training configuration dictionary.
cli_args: Optional CLI arguments for preprocessing or training.
debug: Enable debug mode for dataset preparation.
Returns:
TrainDatasetMeta: Named tuple containing (train_dataset, eval_dataset,
total_num_steps, prompters).
"""
Import
from axolotl.common.datasets import load_datasets
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cfg | DictDefault | Yes | Full config with datasets list, tokenizer settings, sequence_len, val_set_size, etc. |
| cli_args | PreprocessCliArgs or TrainerCliArgs or None | No | CLI arguments controlling preprocessing behavior |
| debug | bool | No (default: False) | Enable debug mode for verbose dataset preparation logging |
Outputs
| Name | Type | Description |
|---|---|---|
| return | TrainDatasetMeta | Named tuple with fields: train_dataset (Dataset), eval_dataset (Dataset or None), total_num_steps (int), prompters (list) |
Usage Examples
Basic Dataset Loading
from axolotl.cli.config import load_cfg
from axolotl.utils.config import validate_config
from axolotl.common.datasets import load_datasets
# Load and validate config
cfg = load_cfg("examples/llama-3/qlora-1b.yml")
cfg = validate_config(cfg)
# Load datasets
dataset_meta = load_datasets(cfg=cfg)
print(f"Training samples: {len(dataset_meta.train_dataset)}")
print(f"Eval samples: {len(dataset_meta.eval_dataset) if dataset_meta.eval_dataset else 0}")
print(f"Total training steps: {dataset_meta.total_num_steps}")
Dataset Loading for Preprocessing
from axolotl.common.datasets import load_datasets
from axolotl.cli.args import PreprocessCliArgs
cli_args = PreprocessCliArgs(download=True)
dataset_meta = load_datasets(cfg=cfg, cli_args=cli_args, debug=True)