Implementation:Huggingface Datasets Dataset From Dict
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for creating a Dataset from a Python dictionary provided by the HuggingFace Datasets library.
Description
Dataset.from_dict is a class method that converts a Python dictionary of column-name-to-values mappings into an Apache Arrow backed Dataset. Each dictionary key becomes a column name and each value (a list or Arrow array) provides the column data. If an explicit Features schema is supplied, columns are encoded and cast accordingly; otherwise, types are inferred. The resulting dataset lives in memory and has no associated cache directory.
Usage
Use Dataset.from_dict when you have data organized as a dictionary of lists (columnar format) and want to create a Dataset without going through file I/O. This is the standard entry point for programmatic dataset construction.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: 973-1034
Signature
@classmethod
def from_dict(
cls,
mapping: dict,
features: Optional[Features] = None,
info: Optional[DatasetInfo] = None,
split: Optional[NamedSplit] = None,
) -> "Dataset":
Import
from datasets import Dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| mapping | dict |
Yes | Mapping of column names (strings) to Arrays or Python lists of values. |
| features | Features |
No | Explicit dataset features schema. If provided, data is cast to match. |
| info | DatasetInfo |
No | Dataset metadata such as description, citation, etc. |
| split | NamedSplit |
No | Name of the dataset split (e.g., "train", "test"). |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Dataset |
A new in-memory Dataset backed by an Arrow table. |
Usage Examples
Basic Usage
from datasets import Dataset
ds = Dataset.from_dict({
"text": ["Hello world", "Goodbye world"],
"label": [1, 0],
})
print(ds)
# Dataset({
# features: ['text', 'label'],
# num_rows: 2
# })
With Explicit Features
from datasets import Dataset, Features, Value, ClassLabel
features = Features({
"text": Value("string"),
"label": ClassLabel(names=["negative", "positive"]),
})
ds = Dataset.from_dict(
{"text": ["Hello", "World"], "label": [0, 1]},
features=features,
)