Implementation:ChenghaoMou Text dedup Config Loading
| Knowledge Sources | |
|---|---|
| Domains | Configuration, Data_Engineering |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for loading and validating pipeline configuration from TOML files provided by pydantic-settings.
Description
The Config class is a Pydantic BaseSettings subclass that serves as the top-level configuration hub for all deduplication algorithms. It reads from TOML files and CLI arguments, then dispatches to algorithm-specific config subclasses via a discriminated union on the algorithm field. The load_config_from_toml helper allows programmatic loading from arbitrary TOML paths. Post-initialization hooks compute derived parameters (e.g., LSH bands/rows for MinHash, bit permutations for SimHash).
Usage
Import this when starting any deduplication pipeline. Use CliApp.run(Config) for CLI invocation or load_config_from_toml(path) for programmatic loading in benchmarks and tests.
Code Reference
Source Location
- Repository: text-dedup
- File: src/text_dedup/config/base.py
- Lines: L16-65
Signature
class Config(BaseSettings):
input: InputConfigType
algorithm: AlgoConfig
output: OutputConfigType
debug: DebugConfig
model_config = SettingsConfigDict(toml_file="config.toml")
def model_post_init(self, context: Any) -> None:
"""Post-init hook: disables cluster saving for SuffixArray."""
...
@classmethod
def settings_customise_sources(
cls,
settings_cls: type[BaseSettings],
init_settings: PydanticBaseSettingsSource,
env_settings: PydanticBaseSettingsSource,
dotenv_settings: PydanticBaseSettingsSource,
file_secret_settings: PydanticBaseSettingsSource,
) -> tuple[PydanticBaseSettingsSource, ...]:
...
def load_config_from_toml(toml_path: Path) -> Config:
"""Load Config from a TOML file."""
...
Import
from text_dedup.config import Config
from text_dedup.config.base import load_config_from_toml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| toml_file | str or Path | Yes | Path to TOML configuration file |
| CLI arguments | str | No | Command-line overrides via pydantic-settings CliApp |
Outputs
| Name | Type | Description |
|---|---|---|
| Config | Config | Fully validated configuration with computed derived parameters |
Usage Examples
CLI Invocation
from pydantic_settings import CliApp
from text_dedup.config import Config
# Parse from default config.toml + CLI overrides
config = CliApp.run(Config)
Programmatic Loading
from pathlib import Path
from text_dedup.config.base import load_config_from_toml
# Load from a specific TOML file
config = load_config_from_toml(Path("configs/minhash.toml"))
print(config.algorithm.algo_name) # "minhash"