Heuristic:Datajuicer Data juicer Partition Size Tuning
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Optimization, Resource_Management |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Resource-aware partition sizing heuristic that dynamically adjusts partition sizes based on data modality (text/image/audio/video), available memory, and cluster resources to optimize distributed processing throughput.
Description
The partition size optimizer uses a multi-factor calculation to determine optimal partition sizes for Ray-based distributed processing. It considers data modality (with memory multipliers ranging from 1x for text to 20x for video), available system memory (with tiered target sizes from 32MB to 256MB), processing complexity, data skew, and parallelism requirements. The optimizer uses P90 memory estimation for robustness against outliers and applies a 2x safety margin for memory calculations.
Usage
Use this heuristic when configuring distributed Ray processing pipelines to avoid out-of-memory errors and maximize throughput. It is especially critical for mixed-modality datasets (text + images + video) where partition sizes must account for vastly different memory footprints per sample. The optimizer runs automatically when using `RayExecutorPartitioned`.
The Insight (Rule of Thumb)
- Modality Memory Multipliers:
- TEXT: 1.0x memory, default 10,000 samples/partition
- IMAGE: 5.0x memory, default 2,000 samples/partition
- AUDIO: 8.0x memory, default 1,000 samples/partition
- VIDEO: 20.0x memory, default 400 samples/partition
- MULTIMODAL: 10.0x memory, default 1,600 samples/partition
- Memory Target Tiers (dynamic based on available RAM):
- < 16 GB available: 32 MB target per partition
- 16-64 GB available: 64 MB target per partition
- 64-256 GB available: 128 MB target per partition
- >= 256 GB available: 256 MB target per partition
- Key Formulas:
- Target samples = target_memory_mb / (memory_per_sample_mb * complexity_multiplier)
- Max partition memory = (available_memory_gb * 1024 * 0.8) / 4 concurrent partitions
- Text bytes per sample = avg_text_length * 2.0 (conservative 2 bytes/char)
- Minimum partitions for large datasets (>10K samples) = CPU cores * 1.5
- Safety Rules:
- 2x safety margin on memory estimates
- 80% of available memory allocated for processing (20% reserved)
- 4 concurrent partitions assumed per node
- Partition size bounded: 32 MB minimum, 512 MB maximum
- Do not exceed 25% of available memory per single partition
- If data skew > 0.7, reduce partition size by 20%
- Sampling Strategy:
- < 1,000 samples: analyze all samples
- 1,000-100,000 samples: sample 1% (min 1,000)
- > 100,000 samples: sample 0.1% (cap at 10,000)
- Use P90 (90th percentile) memory, not mean, for conservative sizing
Reasoning
Partition sizing is a balancing act between memory efficiency, processing throughput, and load balancing. Too-large partitions cause OOM errors; too-small partitions increase scheduling overhead. The modality multipliers reflect real-world memory characteristics: a single video frame can consume 20x more memory than a text sample. The P90 approach prevents partition sizes from being optimized for average-case when outlier samples cause memory spikes. The 2x safety margin accounts for intermediate processing buffers (e.g., tokenization, model inference activations) that amplify memory usage beyond raw data size.
The complexity weight system (0.3 for model/embedding ops, 0.2 for filters, 0.1 for text cleaning) adjusts partition sizes based on operator pipeline complexity. Heavy operators need smaller partitions to avoid memory pressure.
Worker count is set to 75% of available CPU cores to leave headroom for system processes and Ray overhead. For many-partition workloads, workers can scale up to 120% of the base count.
Code Evidence
Modality configuration from `partition_size_optimizer.py:257-302`:
MODALITY_CONFIGS = {
ModalityType.TEXT: ModalityConfig(
default_partition_size=10000, max_partition_size=50000,
max_partition_size_mb=256, memory_multiplier=1.0, complexity_multiplier=1.0,
),
ModalityType.VIDEO: ModalityConfig(
default_partition_size=400, max_partition_size=2000,
max_partition_size_mb=256, memory_multiplier=20.0, complexity_multiplier=15.0,
),
}
Memory target calculation from `partition_size_optimizer.py:247-254`:
if available_memory_gb < 16:
return 32
elif available_memory_gb < 64:
return 64
elif available_memory_gb < 256:
return 128
else:
return 256
P90 memory estimation from `partition_size_optimizer.py:453-460`:
p90_idx = int(len(sorted_sizes) * 0.9)
p90_memory = sorted_sizes[p90_idx]
avg_memory_per_sample_mb = p90_memory # Use p90 for conservative sizing
Data skew adjustment from `partition_size_optimizer.py:674-679`:
if characteristics.data_skew_factor > 0.7:
skew_adjusted_size = int(target_size * 0.8)
logger.info(f"Data skew adjustment: reducing partition size from {target_size} to {skew_adjusted_size}")
target_size = skew_adjusted_size
Worker count heuristic from `partition_size_optimizer.py:194`:
base_workers = max(1, int(available_cores * 0.75)) # 75% of cores