Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer Load Ops

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, Design_Patterns
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for dynamically instantiating data processing operators from a configuration list provided by the Data-Juicer framework.

Description

The load_ops function iterates over a process list (a list of dicts mapping operator names to parameter dicts), looks up each operator in the OPERATORS registry, instantiates it with the configured arguments, and attaches the raw config as _op_cfg for checkpoint tracking. It optionally supports an environment manager for Ray-mode isolated environments.

Usage

Call this function after init_configs to convert cfg.process into a list of executable operator instances. The returned list is passed to dataset processing or analysis functions.

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/ops/load.py
  • Lines: L1-42

Signature

def load_ops(process_list: list, op_env_manager=None) -> list:
    """
    Load and instantiate operators from a process list.

    Args:
        process_list: List of dicts [{op_name: {param: value}}, ...] from config.
        op_env_manager: Optional OPEnvManager for Ray isolated environments.

    Returns:
        List of instantiated OP objects (Filter, Mapper, Deduplicator, Selector).
    """

Import

from data_juicer.ops import load_ops

I/O Contract

Inputs

Name Type Required Description
process_list list Yes List of dicts from cfg.process, each mapping op name to kwargs
op_env_manager OPEnvManager No Environment spec manager for Ray mode

Outputs

Name Type Description
operators List[OP] Instantiated operator objects with _op_cfg attached

Usage Examples

Standard Operator Loading

from data_juicer.config import init_configs
from data_juicer.ops import load_ops

cfg = init_configs(args=['--config', 'pipeline.yaml'])
operators = load_ops(cfg.process)

for op in operators:
    print(f"{op._name}: {type(op).__name__}")
# e.g. "text_length_filter: TextLengthFilter"

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment