Implementation:Huggingface Datasets Get Dataset Config Info For Verification
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for verifying published dataset configuration by inspecting metadata remotely provided by the HuggingFace Datasets library.
Description
get_dataset_config_info retrieves the DatasetInfo for a particular configuration of a dataset without downloading the data files. It instantiates the dataset builder, reads the metadata (from the dataset card YAML and builder configuration), and returns the info object containing features, splits, version, and size statistics. If split information is not available in the metadata, it falls back to running split generators with a streaming download manager to discover split names. This function is useful for post-upload verification to confirm that configuration metadata is correctly registered on the Hub.
Usage
Use get_dataset_config_info after publishing to quickly verify that the dataset's schema, splits, and configuration parameters are correctly set on the Hub without downloading the full dataset.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/inspect.py - Lines: 237-292
Signature
def get_dataset_config_info(
path: str,
config_name: Optional[str] = None,
data_files: Optional[Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]]]] = None,
download_config: Optional[DownloadConfig] = None,
download_mode: Optional[Union[DownloadMode, str]] = None,
revision: Optional[Union[str, Version]] = None,
token: Optional[Union[bool, str]] = None,
**config_kwargs,
) -> DatasetInfo:
Import
from datasets import get_dataset_config_info
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str |
Yes | Path to the dataset repository (e.g., "username/dataset_name" or local directory).
|
| config_name | str |
No | Configuration name to inspect. None uses the default config. |
| data_files | str or Sequence or Mapping |
No | Path(s) to specific data files. |
| download_config | DownloadConfig |
No | Specific download configuration parameters. |
| download_mode | DownloadMode or str |
No | Download/generate mode. |
| revision | str or Version |
No | Version/branch/commit to inspect. |
| token | str or bool |
No | Authentication token for private datasets. |
| **config_kwargs | No | Additional keyword arguments for the builder. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | DatasetInfo |
Metadata about the dataset configuration including features, splits, and size. |
Usage Examples
Verification After Upload
from datasets import get_dataset_config_info
# Verify default configuration
info = get_dataset_config_info("my-username/my-dataset")
print(info.features)
# {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}
print(info.splits)
# {'train': SplitInfo(name='train', num_bytes=..., num_examples=1000),
# 'test': SplitInfo(name='test', num_bytes=..., num_examples=200)}
# Verify a specific configuration
info_en = get_dataset_config_info("my-username/my-dataset", config_name="en")
print(info_en.features)
print(info_en.splits)