Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup Load Dataset

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for loading datasets from local files or HuggingFace format into indexed Dataset objects provided by text-dedup.

Description

The load_dataset' function dispatches to either load_from_disk (for pre-built HuggingFace Dataset directories) or hf_load_dataset (for Parquet/CSV/JSON files) based on the input configuration type. After loading, it adds an ' column via Dataset.map for stable document tracking throughout the pipeline.

Usage

Import this function when implementing a deduplication pipeline that needs to load text data from any supported source format.

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/data_sources/io.py
  • Lines: L29-63

Signature

def load_dataset(config: Config) -> Dataset:
    """Load dataset based on config.input type.

    Dispatches to load_from_disk or hf_load_dataset, then adds __INDEX__ column.

    Parameters
    ----------
    config : Config
        Configuration with input source settings.

    Returns
    -------
    Dataset
        HuggingFace Dataset with __INDEX__ column added.
    """

Import

from text_dedup.data_sources.io import load_dataset

I/O Contract

Inputs

Name Type Required Description
config Config Yes Configuration with input source settings (LocalInputConfig or LocalHFDatasetInputConfig)

Outputs

Name Type Description
Dataset datasets.Dataset HuggingFace Dataset with added column for document tracking

Usage Examples

Loading from Pre-built Dataset

from text_dedup.config.base import load_config_from_toml
from text_dedup.data_sources.io import load_dataset
from pathlib import Path

config = load_config_from_toml(Path("configs/minhash.toml"))
ds = load_dataset(config)
print(ds.column_names)  # ['text', '__INDEX__', ...]
print(len(ds))           # Number of documents

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment