Implementation:Run llama Llama index Download Dataset
| Knowledge Sources | |
|---|---|
| Domains | Download, Dataset |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
Provides functions for downloading LlamaDatasets (labelled evaluation datasets) and their associated source files from the LlamaHub repository.
Description
The download/dataset.py module handles the retrieval of pre-built evaluation datasets from the llama-datasets GitHub repository. It manages a DATASET_CLASS_FILENAME_REGISTRY that maps dataset class names (such as LabelledRagDataset, LabelledPairwiseEvaluatorDataset, and LabelledEvaluatorDataset) to their corresponding JSON filenames.
The module implements a three-step workflow:
- get_dataset_info resolves dataset metadata by looking up the dataset class in a library.json file, first checking a local cache and then fetching from the remote repository if necessary. It returns the dataset ID, the canonical class name, and a list of associated source files.
- download_dataset_and_source_files downloads the dataset JSON file and any source files (handling both text and binary/PDF files) from GitHub LFS URLs.
- download_llama_dataset is the primary entry point that orchestrates the full download flow: initializing a local directory, fetching dataset info, downloading all files, and returning a tuple of paths to the dataset JSON and the source files directory.
The module supports features such as refresh_cache to force re-download, custom_dir/custom_path for controlling download location, override_path to write directly to the specified directory, and show_progress for a tqdm progress bar during source file downloads.
Usage
Use this module when you need to download pre-built labelled datasets from LlamaHub for evaluation purposes. It is typically called through the CLI command llamaindex-cli download-llamadataset or programmatically when setting up evaluation benchmarks with LabelledRagDataset and similar dataset classes.
Code Reference
Source Location
- Repository: Run_llama_Llama_index
- File: llama-index-core/llama_index/core/download/dataset.py
Signature
def download_llama_dataset(
dataset_class: str,
llama_datasets_url: str = LLAMA_DATASETS_URL,
llama_datasets_lfs_url: str = LLAMA_DATASETS_LFS_URL,
llama_datasets_source_files_tree_url: str = LLAMA_DATASETS_SOURCE_FILES_GITHUB_TREE_URL,
refresh_cache: bool = False,
custom_dir: Optional[str] = None,
custom_path: Optional[str] = None,
source_files_dirpath: str = LLAMA_SOURCE_FILES_PATH,
library_path: str = "llama_datasets/library.json",
disable_library_cache: bool = False,
override_path: bool = False,
show_progress: bool = False,
) -> Any
Import
from llama_index.core.download.dataset import download_llama_dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_class | str | Yes | The name of the LlamaDataset class to download (e.g., "LabelledRagDataset"). |
| llama_datasets_url | str | No | Base URL for the raw dataset content on GitHub. Defaults to the llama_index main branch. |
| llama_datasets_lfs_url | str | No | Base URL for GitHub LFS (large file storage) content. Used for downloading actual dataset files. |
| llama_datasets_source_files_tree_url | str | No | GitHub tree URL used to enumerate source files for a dataset. |
| refresh_cache | bool | No | If True, skips local cache and re-downloads from remote. Defaults to False. |
| custom_dir | Optional[str] | No | Custom directory name under the parent folder for downloads. |
| custom_path | Optional[str] | No | Custom absolute directory path for downloads. |
| source_files_dirpath | str | No | Subdirectory name for source files. Defaults to "source_files". |
| library_path | str | No | Relative path to the library.json metadata file. Defaults to "llama_datasets/library.json". |
| disable_library_cache | bool | No | If True, does not write library.json to local cache. Defaults to False. |
| override_path | bool | No | If True, writes files directly to the base directory instead of a dataset_id subdirectory. Defaults to False. |
| show_progress | bool | No | If True, shows a tqdm progress bar during source file download. Defaults to False. |
Outputs
| Name | Type | Description |
|---|---|---|
| result | Tuple[str, str] | A tuple of (path_to_dataset_json, path_to_source_files_directory). |
Usage Examples
from llama_index.core.download.dataset import download_llama_dataset
# Download a labelled RAG dataset
dataset_path, source_files_path = download_llama_dataset(
dataset_class="LabelledRagDataset",
show_progress=True,
)
# Load the dataset from the downloaded JSON
from llama_index.core.llama_dataset import LabelledRagDataset
rag_dataset = LabelledRagDataset.from_json(dataset_path)