Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ChenghaoMou Text dedup Config Loading

From Leeroopedia
Knowledge Sources
Domains Configuration, Data_Engineering
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for loading and validating pipeline configuration from TOML files provided by pydantic-settings.

Description

The Config class is a Pydantic BaseSettings subclass that serves as the top-level configuration hub for all deduplication algorithms. It reads from TOML files and CLI arguments, then dispatches to algorithm-specific config subclasses via a discriminated union on the algorithm field. The load_config_from_toml helper allows programmatic loading from arbitrary TOML paths. Post-initialization hooks compute derived parameters (e.g., LSH bands/rows for MinHash, bit permutations for SimHash).

Usage

Import this when starting any deduplication pipeline. Use CliApp.run(Config) for CLI invocation or load_config_from_toml(path) for programmatic loading in benchmarks and tests.

Code Reference

Source Location

  • Repository: text-dedup
  • File: src/text_dedup/config/base.py
  • Lines: L16-65

Signature

class Config(BaseSettings):
    input: InputConfigType
    algorithm: AlgoConfig
    output: OutputConfigType
    debug: DebugConfig

    model_config = SettingsConfigDict(toml_file="config.toml")

    def model_post_init(self, context: Any) -> None:
        """Post-init hook: disables cluster saving for SuffixArray."""
        ...

    @classmethod
    def settings_customise_sources(
        cls,
        settings_cls: type[BaseSettings],
        init_settings: PydanticBaseSettingsSource,
        env_settings: PydanticBaseSettingsSource,
        dotenv_settings: PydanticBaseSettingsSource,
        file_secret_settings: PydanticBaseSettingsSource,
    ) -> tuple[PydanticBaseSettingsSource, ...]:
        ...


def load_config_from_toml(toml_path: Path) -> Config:
    """Load Config from a TOML file."""
    ...

Import

from text_dedup.config import Config
from text_dedup.config.base import load_config_from_toml

I/O Contract

Inputs

Name Type Required Description
toml_file str or Path Yes Path to TOML configuration file
CLI arguments str No Command-line overrides via pydantic-settings CliApp

Outputs

Name Type Description
Config Config Fully validated configuration with computed derived parameters

Usage Examples

CLI Invocation

from pydantic_settings import CliApp
from text_dedup.config import Config

# Parse from default config.toml + CLI overrides
config = CliApp.run(Config)

Programmatic Loading

from pathlib import Path
from text_dedup.config.base import load_config_from_toml

# Load from a specific TOML file
config = load_config_from_toml(Path("configs/minhash.toml"))
print(config.algorithm.algo_name)  # "minhash"

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment