Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer Init Configs

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Configuration_Management
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for parsing and validating YAML-based pipeline configurations provided by the Data-Juicer framework.

Description

The init_configs function is the central configuration entry point for all Data-Juicer pipelines. It uses jsonargparse to define a typed argument schema, merges CLI arguments with YAML config files, validates the result, and returns a Namespace object containing all pipeline settings. It handles executor type selection, operator process lists, dataset paths, export settings, and logging setup.

Usage

Import and call this function as the first step in any Data-Juicer pipeline. Pass CLI arguments or a YAML config path. Use the which_entry parameter to customize defaults for different entry points (DefaultExecutor, Analyzer, RayExecutor).

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/config/config.py
  • Lines: L101-220

Signature

def init_configs(
    args: Optional[List[str]] = None,
    which_entry: object = None,
    load_configs_only=False
) -> Namespace:
    """
    Initialize and validate pipeline configurations.

    Args:
        args: CLI argument list (e.g. ['--config', 'cfg.yaml']).
              If None, reads from sys.argv.
        which_entry: Executor or Analyzer instance for entry-specific defaults.
        load_configs_only: If True, skip logger/backup setup (for testing).

    Returns:
        Namespace object with all pipeline settings.
    """

Import

from data_juicer.config import init_configs

I/O Contract

Inputs

Name Type Required Description
args Optional[List[str]] No CLI argument list; defaults to sys.argv
which_entry object No Executor/Analyzer instance for entry-specific defaults
load_configs_only bool No Skip side-effect setup when True

Outputs

Name Type Description
cfg Namespace Validated config with dataset_path, process (op list), export_path, executor_type, work_dir, np, etc.

Usage Examples

Basic YAML Config Loading

from data_juicer.config import init_configs

# Load from a YAML config file
cfg = init_configs(args=['--config', 'my_pipeline.yaml'])

# Access settings
print(cfg.dataset_path)    # e.g. '/data/input.jsonl'
print(cfg.export_path)     # e.g. '/data/output.jsonl'
print(cfg.process)         # List of operator dicts
print(cfg.executor_type)   # e.g. 'default'

Programmatic Configuration

from data_juicer.config import init_configs
from data_juicer.core.executor import DefaultExecutor

# Pass executor instance for entry-specific defaults
cfg = init_configs(
    args=['--config', 'process.yaml'],
    which_entry=DefaultExecutor
)

# Use config to create executor
executor = DefaultExecutor(cfg)
executor.run()

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment