Implementation:Datajuicer Data juicer Load Ops
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Design_Patterns |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for dynamically instantiating data processing operators from a configuration list provided by the Data-Juicer framework.
Description
The load_ops function iterates over a process list (a list of dicts mapping operator names to parameter dicts), looks up each operator in the OPERATORS registry, instantiates it with the configured arguments, and attaches the raw config as _op_cfg for checkpoint tracking. It optionally supports an environment manager for Ray-mode isolated environments.
Usage
Call this function after init_configs to convert cfg.process into a list of executable operator instances. The returned list is passed to dataset processing or analysis functions.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/ops/load.py
- Lines: L1-42
Signature
def load_ops(process_list: list, op_env_manager=None) -> list:
"""
Load and instantiate operators from a process list.
Args:
process_list: List of dicts [{op_name: {param: value}}, ...] from config.
op_env_manager: Optional OPEnvManager for Ray isolated environments.
Returns:
List of instantiated OP objects (Filter, Mapper, Deduplicator, Selector).
"""
Import
from data_juicer.ops import load_ops
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| process_list | list | Yes | List of dicts from cfg.process, each mapping op name to kwargs |
| op_env_manager | OPEnvManager | No | Environment spec manager for Ray mode |
Outputs
| Name | Type | Description |
|---|---|---|
| operators | List[OP] | Instantiated operator objects with _op_cfg attached |
Usage Examples
Standard Operator Loading
from data_juicer.config import init_configs
from data_juicer.ops import load_ops
cfg = init_configs(args=['--config', 'pipeline.yaml'])
operators = load_ops(cfg.process)
for op in operators:
print(f"{op._name}: {type(op).__name__}")
# e.g. "text_length_filter: TextLengthFilter"