Implementation:Run llama Llama index Download Dataset

Knowledge Sources	Run_llama_Llama_index
Domains	Download, Dataset
Last Updated	2026-02-11 19:00 GMT

Overview

Provides functions for downloading LlamaDatasets (labelled evaluation datasets) and their associated source files from the LlamaHub repository.

Description

The download/dataset.py module handles the retrieval of pre-built evaluation datasets from the llama-datasets GitHub repository. It manages a DATASET_CLASS_FILENAME_REGISTRY that maps dataset class names (such as LabelledRagDataset, LabelledPairwiseEvaluatorDataset, and LabelledEvaluatorDataset) to their corresponding JSON filenames.

The module implements a three-step workflow:

get_dataset_info resolves dataset metadata by looking up the dataset class in a library.json file, first checking a local cache and then fetching from the remote repository if necessary. It returns the dataset ID, the canonical class name, and a list of associated source files.
download_dataset_and_source_files downloads the dataset JSON file and any source files (handling both text and binary/PDF files) from GitHub LFS URLs.
download_llama_dataset is the primary entry point that orchestrates the full download flow: initializing a local directory, fetching dataset info, downloading all files, and returning a tuple of paths to the dataset JSON and the source files directory.

The module supports features such as refresh_cache to force re-download, custom_dir/custom_path for controlling download location, override_path to write directly to the specified directory, and show_progress for a tqdm progress bar during source file downloads.

Usage

Use this module when you need to download pre-built labelled datasets from LlamaHub for evaluation purposes. It is typically called through the CLI command llamaindex-cli download-llamadataset or programmatically when setting up evaluation benchmarks with LabelledRagDataset and similar dataset classes.

Code Reference

Source Location

Repository: Run_llama_Llama_index
File: llama-index-core/llama_index/core/download/dataset.py

Signature

def download_llama_dataset(
    dataset_class: str,
    llama_datasets_url: str = LLAMA_DATASETS_URL,
    llama_datasets_lfs_url: str = LLAMA_DATASETS_LFS_URL,
    llama_datasets_source_files_tree_url: str = LLAMA_DATASETS_SOURCE_FILES_GITHUB_TREE_URL,
    refresh_cache: bool = False,
    custom_dir: Optional[str] = None,
    custom_path: Optional[str] = None,
    source_files_dirpath: str = LLAMA_SOURCE_FILES_PATH,
    library_path: str = "llama_datasets/library.json",
    disable_library_cache: bool = False,
    override_path: bool = False,
    show_progress: bool = False,
) -> Any

Import

from llama_index.core.download.dataset import download_llama_dataset

I/O Contract

Inputs

Name	Type	Required	Description
dataset_class	str	Yes	The name of the LlamaDataset class to download (e.g., "LabelledRagDataset").
llama_datasets_url	str	No	Base URL for the raw dataset content on GitHub. Defaults to the llama_index main branch.
llama_datasets_lfs_url	str	No	Base URL for GitHub LFS (large file storage) content. Used for downloading actual dataset files.
llama_datasets_source_files_tree_url	str	No	GitHub tree URL used to enumerate source files for a dataset.
refresh_cache	bool	No	If True, skips local cache and re-downloads from remote. Defaults to False.
custom_dir	Optional[str]	No	Custom directory name under the parent folder for downloads.
custom_path	Optional[str]	No	Custom absolute directory path for downloads.
source_files_dirpath	str	No	Subdirectory name for source files. Defaults to "source_files".
library_path	str	No	Relative path to the library.json metadata file. Defaults to "llama_datasets/library.json".
disable_library_cache	bool	No	If True, does not write library.json to local cache. Defaults to False.
override_path	bool	No	If True, writes files directly to the base directory instead of a dataset_id subdirectory. Defaults to False.
show_progress	bool	No	If True, shows a tqdm progress bar during source file download. Defaults to False.

Outputs

Name	Type	Description
result	Tuple[str, str]	A tuple of (path_to_dataset_json, path_to_source_files_directory).

Usage Examples

from llama_index.core.download.dataset import download_llama_dataset

# Download a labelled RAG dataset
dataset_path, source_files_path = download_llama_dataset(
    dataset_class="LabelledRagDataset",
    show_progress=True,
)

# Load the dataset from the downloaded JSON
from llama_index.core.llama_dataset import LabelledRagDataset
rag_dataset = LabelledRagDataset.from_json(dataset_path)

Related Pages

Environment:Run_llama_Llama_index_Python_LlamaIndex_Core

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment