Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Get Dataset Config Info

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for retrieving detailed metadata (features, splits, size) for a specific dataset configuration, provided by the HuggingFace Datasets library.

Description

get_dataset_config_info returns a DatasetInfo object containing the full metadata for a specific configuration of a dataset. It instantiates a DatasetBuilder via load_dataset_builder, reads its .info property, and if splits information is missing, invokes the builder's _split_generators using a StreamingDownloadManager to discover splits without downloading the full dataset. If split discovery fails, it raises a SplitsNotFoundError.

Usage

Use get_dataset_config_info when you need the complete metadata for a dataset configuration, including features, splits, description, and size information, without downloading the actual data files.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/inspect.py
  • Lines: 237-292

Signature

def get_dataset_config_info(
    path: str,
    config_name: Optional[str] = None,
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
    download_config: Optional[DownloadConfig] = None,
    download_mode: Optional[Union[DownloadMode, str]] = None,
    revision: Optional[Union[str, Version]] = None,
    token: Optional[Union[bool, str]] = None,
    **config_kwargs,
) -> DatasetInfo:

Import

from datasets import get_dataset_config_info

I/O Contract

Inputs

Name Type Required Description
path str Yes Path to the dataset repository. Can be a local path or a Hub dataset identifier (e.g. 'rajpurkar/squad').
config_name Optional[str] No Name of the dataset configuration. If None, uses the default configuration.
data_files Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] No Path(s) to source data file(s).
download_config Optional[DownloadConfig] No Specific download configuration parameters.
download_mode Optional[Union[DownloadMode, str]] No Download/generate mode. Defaults to REUSE_DATASET_IF_EXISTS.
revision Optional[Union[str, Version]] No Version of the dataset to load (commit SHA, git tag, or branch).
token Optional[Union[bool, str]] No Bearer token for remote files on the Datasets Hub.
**config_kwargs keyword arguments No Additional attributes for the builder class that override defaults.

Outputs

Name Type Description
info DatasetInfo A DatasetInfo object containing the dataset's features, splits, description, citation, license, dataset size, and other metadata.

Usage Examples

Basic Usage

from datasets import get_dataset_config_info

info = get_dataset_config_info("cornell-movie-review-data/rotten_tomatoes")
print(info.features)
# {'label': ClassLabel(names=['neg', 'pos']), 'text': Value('string')}
print(list(info.splits.keys()))
# ['train', 'validation', 'test']

Inspecting a Specific Configuration

from datasets import get_dataset_config_info

info = get_dataset_config_info("nyu-mll/glue", config_name="mrpc")
print(info.features)
# {'sentence1': Value('string'), 'sentence2': Value('string'),
#  'label': ClassLabel(names=['not_equivalent', 'equivalent']),
#  'idx': Value('int32')}

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment