Implementation:Datajuicer Data juicer Init Configs
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Configuration_Management |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for parsing and validating YAML-based pipeline configurations provided by the Data-Juicer framework.
Description
The init_configs function is the central configuration entry point for all Data-Juicer pipelines. It uses jsonargparse to define a typed argument schema, merges CLI arguments with YAML config files, validates the result, and returns a Namespace object containing all pipeline settings. It handles executor type selection, operator process lists, dataset paths, export settings, and logging setup.
Usage
Import and call this function as the first step in any Data-Juicer pipeline. Pass CLI arguments or a YAML config path. Use the which_entry parameter to customize defaults for different entry points (DefaultExecutor, Analyzer, RayExecutor).
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/config/config.py
- Lines: L101-220
Signature
def init_configs(
args: Optional[List[str]] = None,
which_entry: object = None,
load_configs_only=False
) -> Namespace:
"""
Initialize and validate pipeline configurations.
Args:
args: CLI argument list (e.g. ['--config', 'cfg.yaml']).
If None, reads from sys.argv.
which_entry: Executor or Analyzer instance for entry-specific defaults.
load_configs_only: If True, skip logger/backup setup (for testing).
Returns:
Namespace object with all pipeline settings.
"""
Import
from data_juicer.config import init_configs
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | Optional[List[str]] | No | CLI argument list; defaults to sys.argv |
| which_entry | object | No | Executor/Analyzer instance for entry-specific defaults |
| load_configs_only | bool | No | Skip side-effect setup when True |
Outputs
| Name | Type | Description |
|---|---|---|
| cfg | Namespace | Validated config with dataset_path, process (op list), export_path, executor_type, work_dir, np, etc. |
Usage Examples
Basic YAML Config Loading
from data_juicer.config import init_configs
# Load from a YAML config file
cfg = init_configs(args=['--config', 'my_pipeline.yaml'])
# Access settings
print(cfg.dataset_path) # e.g. '/data/input.jsonl'
print(cfg.export_path) # e.g. '/data/output.jsonl'
print(cfg.process) # List of operator dicts
print(cfg.executor_type) # e.g. 'default'
Programmatic Configuration
from data_juicer.config import init_configs
from data_juicer.core.executor import DefaultExecutor
# Pass executor instance for entry-specific defaults
cfg = init_configs(
args=['--config', 'process.yaml'],
which_entry=DefaultExecutor
)
# Use config to create executor
executor = DefaultExecutor(cfg)
executor.run()