Heuristic:Datajuicer Data juicer Batch Size Adaptation

Knowledge Sources	Data-Juicer Adapter Module Base Operator
Domains	Optimization, Resource_Management, Data_Processing
Last Updated	2026-02-14 17:00 GMT

Overview

Adaptive batch sizing strategy that uses resource probing to dynamically calculate per-operator batch sizes based on CPU/memory utilization, with a 100x difference between CPU (1000) and CUDA (10) defaults.

Description

Data-Juicer uses a two-phase approach to batch size optimization. First, a small batch probe executes each operator while monitoring resource utilization at 0.5-second intervals. Then, a load factor is computed from the bottleneck resource (highest utilization excluding GPU) to scale the batch size. The system enforces a global maximum of 10,000 samples per batch and uses a 90% utilization threshold to prevent system overload. GPU operators default to batch_size=10 (vs 1000 for CPU) due to VRAM constraints.

Usage

Use this heuristic when tuning data processing pipeline performance. The adaptive system runs automatically when the Adapter is enabled, but understanding the defaults helps with manual tuning. Increase batch size for simple text operators; decrease for GPU-heavy operators or when encountering OOM errors.

The Insight (Rule of Thumb)

Default Batch Sizes:
- CPU operators: 1,000 samples (DEFAULT_BATCH_SIZE)
- CUDA operators: 10 samples (100x smaller due to GPU memory)
- Maximum: 10,000 samples (hard cap)
- Minimum: 1 sample

Adaptive Probing:
- Small batch executed per operator with resource monitoring
- Resource sampling every 0.5 seconds
- Dataset expanded by `num_proc` (worker count) during probing to simulate real load
- Probe batch size set to full sample count for peak utilization measurement

Load Factor Calculation:
- Utilization threshold: 90% (default)
- Bottleneck resource: highest non-GPU utilization metric
- Load factor = available_headroom / bottleneck_utilization
- Final batch size = base_batch_size * load_factor, bounded to [1, 10000]

Statistical Insight Mining:
- T-Test (p-value < 0.05) to detect significant distribution changes between operators
- Helps identify which operators materially transform the data

Multiprocessing Context:
- CUDA operators: Must use `forkserver` or `spawn` (fork unsafe with CUDA)
- Unforkable operators: Also use `forkserver` or `spawn`
- All others: Default `fork` method

Reasoning

The 100x difference between CPU and CUDA defaults reflects the fundamental asymmetry between CPU and GPU memory architectures. CPUs have large addressable RAM (often 64GB+) while GPUs have limited VRAM (8-24GB typical). Processing 1000 samples simultaneously on GPU would cause OOM for most ML operators.

The 90% utilization threshold prevents system instability. At 100% utilization, the system has no headroom for spikes, garbage collection, or OS processes. The 10% buffer acts as a safety valve.

The bottleneck resource approach (ignoring GPU utilization for batch sizing) separates concerns: batch size controls data flow through CPU/memory, while `num_proc` controls GPU parallelism. This prevents conflating two independent dimensions of resource management.

The probing phase expands the dataset by `num_proc` to simulate real multi-process execution, ensuring the measured utilization reflects actual production behavior rather than artificially low single-process metrics.

Code Evidence

Default batch sizes from `base_op.py:27,378-381`:

DEFAULT_BATCH_SIZE = 1000

if self.accelerator == "cuda":
    self.batch_size = kwargs.get("batch_size", 10)
else:
    self.batch_size = kwargs.get("batch_size", DEFAULT_BATCH_SIZE)

Adaptive batch size calculation from `adapter.py:142-173`:

def batch_size_strategy(self, load_analysis_res, base_bs=1, util_th=0.9):
    left_utils = {}
    for key in self.idle_resources:
        if "util." not in key or "GPU" in key:
            continue
        left_utils[key] = max(0, util_th - self.idle_resources[key])

    for item in load_analysis_res:
        max_util = 1e-5  # avoid division by zero
        for key in analysis_res:
            if "util." not in key or "GPU" in key:
                continue
            used_util = max(0, analysis_res[key]["max"] - self.idle_resources[key])
            if used_util > max_util:
                max_util = used_util
                max_key = key
        load_factor = left_utils[max_key] / max_util
        bs_this_op = min(max(int(base_bs * load_factor), 1), self.MAX_BATCH_SIZE)

Probing with dataset expansion from `adapter.py:54-72`:

# expand dataset by num_proc to simulate real load
expanded_dataset = concatenate_datasets([dataset] * op.runtime_np())

# set probe batch size to full sample count
if op.is_batched_op():
    old_batch_size = op.batch_size
    op.batch_size = sample_num

_, resource_util_per_op = Monitor.monitor_func(
    op.run, args=(expanded_dataset,), sample_interval=sample_interval
)
resource_util_per_op["speed"] = sample_num / resource_util_per_op["time"]

MAX_BATCH_SIZE constant from `adapter.py:18`:

MAX_BATCH_SIZE = 10000

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment