Implementation:Datajuicer Data juicer OPEnvSpec And LazyLoader
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, DevOps |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tools for managing operator dependencies via lazy loading and isolated environments provided by the Data-Juicer framework.
Description
OPEnvSpec defines per-operator environment specifications including pip package requirements, environment variables, and working directory. It is used by the OPEnvManager to create isolated uv-based virtual environments for each operator in Ray distributed mode.
LazyLoader provides deferred module imports that delay actual import until first attribute access. If the module is not installed, it can auto-install it via pip or uv.
Usage
Set _requirements class attribute on operators for Ray isolated environments. Use LazyLoader for expensive or optional imports in operator files.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/ops/op_env.py (OPEnvSpec), data_juicer/utils/lazy_loader.py (LazyLoader)
- Lines: op_env.py:L129-159, lazy_loader.py:L41-469
Signature
class OPEnvSpec:
def __init__(
self,
pip_pkgs=None,
env_vars=None,
working_dir=None,
backend='uv',
**kwargs
):
"""
Args:
pip_pkgs: List of pip requirements (e.g. ['torch>=2.0']).
env_vars: Dict of environment variables.
working_dir: Working directory for the operator.
backend: Package manager ('uv' or 'pip').
"""
class LazyLoader:
def __init__(self, module_name: str):
"""
Create a lazy module proxy.
Args:
module_name: Module to import on first access.
"""
Import
from data_juicer.ops.op_env import OPEnvSpec
from data_juicer.utils.lazy_loader import LazyLoader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pip_pkgs | List[str] | No | Pip requirement strings |
| module_name | str | Yes (LazyLoader) | Python module name to lazy-import |
Outputs
| Name | Type | Description |
|---|---|---|
| env_spec | OPEnvSpec | Environment specification for Ray isolated env |
| module_proxy | LazyLoader | Proxy that imports module on first use |
Usage Examples
LazyLoader Usage
from data_juicer.utils.lazy_loader import LazyLoader
# Deferred import - does not actually import torch
torch = LazyLoader('torch')
# Later, first attribute access triggers real import
tensor = torch.zeros(3, 3) # torch is imported here
OPEnvSpec for Ray Mode
from data_juicer.ops.base_op import OPERATORS, Mapper
@OPERATORS.register_module('my_ml_mapper')
class MyMLMapper(Mapper):
# Declare dependencies for Ray isolated environment
_requirements = ['torch>=2.0', 'transformers>=4.30', 'scipy']
def process_single(self, sample):
import torch # Available in isolated env
# ... process
return sample