Implementation:Datajuicer Data juicer PartitionSizeOptimizer Calculate
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Performance_Optimization |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for calculating optimal data partition sizes in distributed Ray pipelines provided by the Data-Juicer framework.
Description
The PartitionSizeOptimizer class uses system resource information (available memory via psutil) and data modality characteristics to determine optimal partition sizes. It defines modality-specific configurations (MODALITY_CONFIGS) with default partition sizes, max partition sizes, memory multipliers, and complexity multipliers for text, image, audio, video, and multimodal data.
Usage
Used internally by PartitionedRayExecutor to determine partition count before distributing data across Ray workers. Can also be used directly for custom partitioning strategies.
Code Reference
Source Location
- Repository: data-juicer
- File: data_juicer/core/executor/partition_size_optimizer.py
- Lines: L231-855
Signature
class PartitionSizeOptimizer:
"""Automatically optimizes partition sizes based on data characteristics
and available resources."""
# Modality-specific configurations
MODALITY_CONFIGS = {
ModalityType.TEXT: ModalityConfig(
default_partition_size=10000,
max_partition_size=50000,
max_partition_size_mb=256,
memory_multiplier=1.0,
complexity_multiplier=1.0,
),
ModalityType.IMAGE: ModalityConfig(
default_partition_size=2000,
max_partition_size=10000,
max_partition_size_mb=256,
memory_multiplier=5.0,
complexity_multiplier=3.0,
),
ModalityType.VIDEO: ModalityConfig(
default_partition_size=400,
max_partition_size=2000,
max_partition_size_mb=256,
memory_multiplier=20.0,
complexity_multiplier=15.0,
),
# ... AUDIO, MULTIMODAL configs
}
def calculate_target_partition_mb(
self,
available_memory_gb: float
) -> int:
"""
Calculate target partition size in MB.
Args:
available_memory_gb: Available system memory in GB.
Returns:
Target partition size in MB (32-256 range).
"""
Import
from data_juicer.core.executor.partition_size_optimizer import PartitionSizeOptimizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| available_memory_gb | float | Yes | Available system memory in GB |
| cfg.partition.target_size_mb | int | No | Configured override for partition size |
Outputs
| Name | Type | Description |
|---|---|---|
| target_size_mb | int | Optimal partition size in MB (32-256 range) |
Usage Examples
Automatic Partition Sizing
from data_juicer.core.executor.partition_size_optimizer import PartitionSizeOptimizer
optimizer = PartitionSizeOptimizer()
# For a machine with 128GB RAM
target_mb = optimizer.calculate_target_partition_mb(128.0)
print(f"Target partition size: {target_mb} MB") # 128 MB