Principle:Datajuicer Data juicer Partition Size Optimization
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Performance_Optimization |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
A resource-aware heuristic technique that calculates optimal data partition sizes for distributed processing based on available memory and data modality.
Description
Partition Size Optimization determines how large each data partition should be when splitting a dataset across distributed workers. Too-small partitions create excessive scheduling overhead; too-large partitions cause out-of-memory errors. The optimizer considers available system memory, data modality (text is lightweight; video is memory-intensive), and configurable overrides to select partition sizes in the 32MB-256MB range. Each modality (text, image, audio, video, multimodal) has distinct memory multipliers and default partition sizes.
Usage
Use this principle when running partitioned Ray pipelines. It is automatically applied by the PartitionedRayExecutor to determine how to split the dataset before distributed processing.
Theoretical Basis
# Abstract algorithm (NOT real implementation)
def calculate_partition_size(available_memory_gb, modality):
# Use configured target if available
if config.partition.target_size_mb:
return config.partition.target_size_mb
# Dynamic sizing based on available memory
if available_memory_gb < 16:
base_size = 32 # MB
elif available_memory_gb < 64:
base_size = 64
elif available_memory_gb < 256:
base_size = 128
else:
base_size = 256
# Adjust by modality memory multiplier
partition_size = base_size / modality.memory_multiplier
return clamp(partition_size, min_size, max_size)