Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index Download Dataset

From Leeroopedia
Knowledge Sources
Domains Download, Dataset
Last Updated 2026-02-11 19:00 GMT

Overview

Provides functions for downloading LlamaDatasets (labelled evaluation datasets) and their associated source files from the LlamaHub repository.

Description

The download/dataset.py module handles the retrieval of pre-built evaluation datasets from the llama-datasets GitHub repository. It manages a DATASET_CLASS_FILENAME_REGISTRY that maps dataset class names (such as LabelledRagDataset, LabelledPairwiseEvaluatorDataset, and LabelledEvaluatorDataset) to their corresponding JSON filenames.

The module implements a three-step workflow:

  1. get_dataset_info resolves dataset metadata by looking up the dataset class in a library.json file, first checking a local cache and then fetching from the remote repository if necessary. It returns the dataset ID, the canonical class name, and a list of associated source files.
  2. download_dataset_and_source_files downloads the dataset JSON file and any source files (handling both text and binary/PDF files) from GitHub LFS URLs.
  3. download_llama_dataset is the primary entry point that orchestrates the full download flow: initializing a local directory, fetching dataset info, downloading all files, and returning a tuple of paths to the dataset JSON and the source files directory.

The module supports features such as refresh_cache to force re-download, custom_dir/custom_path for controlling download location, override_path to write directly to the specified directory, and show_progress for a tqdm progress bar during source file downloads.

Usage

Use this module when you need to download pre-built labelled datasets from LlamaHub for evaluation purposes. It is typically called through the CLI command llamaindex-cli download-llamadataset or programmatically when setting up evaluation benchmarks with LabelledRagDataset and similar dataset classes.

Code Reference

Source Location

Signature

def download_llama_dataset(
    dataset_class: str,
    llama_datasets_url: str = LLAMA_DATASETS_URL,
    llama_datasets_lfs_url: str = LLAMA_DATASETS_LFS_URL,
    llama_datasets_source_files_tree_url: str = LLAMA_DATASETS_SOURCE_FILES_GITHUB_TREE_URL,
    refresh_cache: bool = False,
    custom_dir: Optional[str] = None,
    custom_path: Optional[str] = None,
    source_files_dirpath: str = LLAMA_SOURCE_FILES_PATH,
    library_path: str = "llama_datasets/library.json",
    disable_library_cache: bool = False,
    override_path: bool = False,
    show_progress: bool = False,
) -> Any

Import

from llama_index.core.download.dataset import download_llama_dataset

I/O Contract

Inputs

Name Type Required Description
dataset_class str Yes The name of the LlamaDataset class to download (e.g., "LabelledRagDataset").
llama_datasets_url str No Base URL for the raw dataset content on GitHub. Defaults to the llama_index main branch.
llama_datasets_lfs_url str No Base URL for GitHub LFS (large file storage) content. Used for downloading actual dataset files.
llama_datasets_source_files_tree_url str No GitHub tree URL used to enumerate source files for a dataset.
refresh_cache bool No If True, skips local cache and re-downloads from remote. Defaults to False.
custom_dir Optional[str] No Custom directory name under the parent folder for downloads.
custom_path Optional[str] No Custom absolute directory path for downloads.
source_files_dirpath str No Subdirectory name for source files. Defaults to "source_files".
library_path str No Relative path to the library.json metadata file. Defaults to "llama_datasets/library.json".
disable_library_cache bool No If True, does not write library.json to local cache. Defaults to False.
override_path bool No If True, writes files directly to the base directory instead of a dataset_id subdirectory. Defaults to False.
show_progress bool No If True, shows a tqdm progress bar during source file download. Defaults to False.

Outputs

Name Type Description
result Tuple[str, str] A tuple of (path_to_dataset_json, path_to_source_files_directory).

Usage Examples

from llama_index.core.download.dataset import download_llama_dataset

# Download a labelled RAG dataset
dataset_path, source_files_path = download_llama_dataset(
    dataset_class="LabelledRagDataset",
    show_progress=True,
)

# Load the dataset from the downloaded JSON
from llama_index.core.llama_dataset import LabelledRagDataset
rag_dataset = LabelledRagDataset.from_json(dataset_path)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment