Implementation:ChenghaoMou Text dedup Load Dataset
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for loading datasets from local files or HuggingFace format into indexed Dataset objects provided by text-dedup.
Description
The load_dataset' function dispatches to either load_from_disk (for pre-built HuggingFace Dataset directories) or hf_load_dataset (for Parquet/CSV/JSON files) based on the input configuration type. After loading, it adds an ' column via Dataset.map for stable document tracking throughout the pipeline.
Usage
Import this function when implementing a deduplication pipeline that needs to load text data from any supported source format.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/data_sources/io.py
- Lines: L29-63
Signature
def load_dataset(config: Config) -> Dataset:
"""Load dataset based on config.input type.
Dispatches to load_from_disk or hf_load_dataset, then adds __INDEX__ column.
Parameters
----------
config : Config
Configuration with input source settings.
Returns
-------
Dataset
HuggingFace Dataset with __INDEX__ column added.
"""
Import
from text_dedup.data_sources.io import load_dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | Config | Yes | Configuration with input source settings (LocalInputConfig or LocalHFDatasetInputConfig) |
Outputs
| Name | Type | Description |
|---|---|---|
| Dataset | datasets.Dataset | HuggingFace Dataset with added column for document tracking |
Usage Examples
Loading from Pre-built Dataset
from text_dedup.config.base import load_config_from_toml
from text_dedup.data_sources.io import load_dataset
from pathlib import Path
config = load_config_from_toml(Path("configs/minhash.toml"))
ds = load_dataset(config)
print(ds.column_names) # ['text', '__INDEX__', ...]
print(len(ds)) # Number of documents
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment