Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer OPEnvSpec And LazyLoader

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, DevOps
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tools for managing operator dependencies via lazy loading and isolated environments provided by the Data-Juicer framework.

Description

OPEnvSpec defines per-operator environment specifications including pip package requirements, environment variables, and working directory. It is used by the OPEnvManager to create isolated uv-based virtual environments for each operator in Ray distributed mode.

LazyLoader provides deferred module imports that delay actual import until first attribute access. If the module is not installed, it can auto-install it via pip or uv.

Usage

Set _requirements class attribute on operators for Ray isolated environments. Use LazyLoader for expensive or optional imports in operator files.

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/ops/op_env.py (OPEnvSpec), data_juicer/utils/lazy_loader.py (LazyLoader)
  • Lines: op_env.py:L129-159, lazy_loader.py:L41-469

Signature

class OPEnvSpec:
    def __init__(
        self,
        pip_pkgs=None,
        env_vars=None,
        working_dir=None,
        backend='uv',
        **kwargs
    ):
        """
        Args:
            pip_pkgs: List of pip requirements (e.g. ['torch>=2.0']).
            env_vars: Dict of environment variables.
            working_dir: Working directory for the operator.
            backend: Package manager ('uv' or 'pip').
        """

class LazyLoader:
    def __init__(self, module_name: str):
        """
        Create a lazy module proxy.

        Args:
            module_name: Module to import on first access.
        """

Import

from data_juicer.ops.op_env import OPEnvSpec
from data_juicer.utils.lazy_loader import LazyLoader

I/O Contract

Inputs

Name Type Required Description
pip_pkgs List[str] No Pip requirement strings
module_name str Yes (LazyLoader) Python module name to lazy-import

Outputs

Name Type Description
env_spec OPEnvSpec Environment specification for Ray isolated env
module_proxy LazyLoader Proxy that imports module on first use

Usage Examples

LazyLoader Usage

from data_juicer.utils.lazy_loader import LazyLoader

# Deferred import - does not actually import torch
torch = LazyLoader('torch')

# Later, first attribute access triggers real import
tensor = torch.zeros(3, 3)  # torch is imported here

OPEnvSpec for Ray Mode

from data_juicer.ops.base_op import OPERATORS, Mapper

@OPERATORS.register_module('my_ml_mapper')
class MyMLMapper(Mapper):
    # Declare dependencies for Ray isolated environment
    _requirements = ['torch>=2.0', 'transformers>=4.30', 'scipy']

    def process_single(self, sample):
        import torch  # Available in isolated env
        # ... process
        return sample

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment