Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Datajuicer Data juicer PartitionSizeOptimizer Calculate

From Leeroopedia
Knowledge Sources
Domains Distributed_Computing, Performance_Optimization
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for calculating optimal data partition sizes in distributed Ray pipelines provided by the Data-Juicer framework.

Description

The PartitionSizeOptimizer class uses system resource information (available memory via psutil) and data modality characteristics to determine optimal partition sizes. It defines modality-specific configurations (MODALITY_CONFIGS) with default partition sizes, max partition sizes, memory multipliers, and complexity multipliers for text, image, audio, video, and multimodal data.

Usage

Used internally by PartitionedRayExecutor to determine partition count before distributing data across Ray workers. Can also be used directly for custom partitioning strategies.

Code Reference

Source Location

  • Repository: data-juicer
  • File: data_juicer/core/executor/partition_size_optimizer.py
  • Lines: L231-855

Signature

class PartitionSizeOptimizer:
    """Automatically optimizes partition sizes based on data characteristics
    and available resources."""

    # Modality-specific configurations
    MODALITY_CONFIGS = {
        ModalityType.TEXT: ModalityConfig(
            default_partition_size=10000,
            max_partition_size=50000,
            max_partition_size_mb=256,
            memory_multiplier=1.0,
            complexity_multiplier=1.0,
        ),
        ModalityType.IMAGE: ModalityConfig(
            default_partition_size=2000,
            max_partition_size=10000,
            max_partition_size_mb=256,
            memory_multiplier=5.0,
            complexity_multiplier=3.0,
        ),
        ModalityType.VIDEO: ModalityConfig(
            default_partition_size=400,
            max_partition_size=2000,
            max_partition_size_mb=256,
            memory_multiplier=20.0,
            complexity_multiplier=15.0,
        ),
        # ... AUDIO, MULTIMODAL configs
    }

    def calculate_target_partition_mb(
        self,
        available_memory_gb: float
    ) -> int:
        """
        Calculate target partition size in MB.

        Args:
            available_memory_gb: Available system memory in GB.

        Returns:
            Target partition size in MB (32-256 range).
        """

Import

from data_juicer.core.executor.partition_size_optimizer import PartitionSizeOptimizer

I/O Contract

Inputs

Name Type Required Description
available_memory_gb float Yes Available system memory in GB
cfg.partition.target_size_mb int No Configured override for partition size

Outputs

Name Type Description
target_size_mb int Optimal partition size in MB (32-256 range)

Usage Examples

Automatic Partition Sizing

from data_juicer.core.executor.partition_size_optimizer import PartitionSizeOptimizer

optimizer = PartitionSizeOptimizer()
# For a machine with 128GB RAM
target_mb = optimizer.calculate_target_partition_mb(128.0)
print(f"Target partition size: {target_mb} MB")  # 128 MB

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment