Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Get Dataset Config Info For Verification

From Leeroopedia
Revision as of 12:59, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_Get_Dataset_Config_Info_For_Verification.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for verifying published dataset configuration by inspecting metadata remotely provided by the HuggingFace Datasets library.

Description

get_dataset_config_info retrieves the DatasetInfo for a particular configuration of a dataset without downloading the data files. It instantiates the dataset builder, reads the metadata (from the dataset card YAML and builder configuration), and returns the info object containing features, splits, version, and size statistics. If split information is not available in the metadata, it falls back to running split generators with a streaming download manager to discover split names. This function is useful for post-upload verification to confirm that configuration metadata is correctly registered on the Hub.

Usage

Use get_dataset_config_info after publishing to quickly verify that the dataset's schema, splits, and configuration parameters are correctly set on the Hub without downloading the full dataset.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/inspect.py
  • Lines: 237-292

Signature

def get_dataset_config_info(
    path: str,
    config_name: Optional[str] = None,
    data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
    download_config: Optional[DownloadConfig] = None,
    download_mode: Optional[Union[DownloadMode, str]] = None,
    revision: Optional[Union[str, Version]] = None,
    token: Optional[Union[bool, str]] = None,
    **config_kwargs,
) -> DatasetInfo:

Import

from datasets import get_dataset_config_info

I/O Contract

Inputs

Name Type Required Description
path str Yes Path to the dataset repository (e.g., "username/dataset_name" or local directory).
config_name str No Configuration name to inspect. None uses the default config.
data_files str or Sequence or Mapping No Path(s) to specific data files.
download_config DownloadConfig No Specific download configuration parameters.
download_mode DownloadMode or str No Download/generate mode.
revision str or Version No Version/branch/commit to inspect.
token str or bool No Authentication token for private datasets.
**config_kwargs No Additional keyword arguments for the builder.

Outputs

Name Type Description
return DatasetInfo Metadata about the dataset configuration including features, splits, and size.

Usage Examples

Verification After Upload

from datasets import get_dataset_config_info

# Verify default configuration
info = get_dataset_config_info("my-username/my-dataset")
print(info.features)
# {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}
print(info.splits)
# {'train': SplitInfo(name='train', num_bytes=..., num_examples=1000),
#  'test': SplitInfo(name='test', num_bytes=..., num_examples=200)}

# Verify a specific configuration
info_en = get_dataset_config_info("my-username/my-dataset", config_name="en")
print(info_en.features)
print(info_en.splits)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment