Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval Get Split Zip

From Leeroopedia

File: `/tmp/kapso_repo_sslb_59s/tools/get_split_zip.py`

Principle: Data_Tooling

Overview

The Get Split Zip tool splits large ZIP files into smaller parts to comply with hosting limits (e.g., Hugging Face's 5GB per-file limit). It reads files from a source ZIP and distributes them across multiple output ZIPs based on a configurable size threshold.

Key Components

Size Parsing

def parse_size(size_str: str) -> int:
    """Parse human-readable size string to bytes.

    Args:
        size_str: Size string like "5GB", "500MB", "1024" (bytes)

    Returns:
        Size in bytes
    """
    size_str = size_str.strip().upper()
    units = {"B": 1, "KB": 1024, "MB": 1024**2, "GB": 1024**3, "TB": 1024**4}

    for unit, multiplier in units.items():
        if size_str.endswith(unit):
            return int(float(size_str[: -len(unit)]) * multiplier)

    return int(size_str)

The parser:

  • Supports units: B, KB, MB, GB, TB
  • Handles both integer and float values (e.g., "2.5GB")
  • Falls back to treating raw numbers as bytes
  • Case-insensitive unit parsing

ZIP Splitting Logic

def split_zip(input_zip: str, output_dir: str, max_size: int = 5 * 1024**3) -> int:
    """Split a ZIP file into multiple smaller ZIP files.

    Args:
        input_zip: Path to the input ZIP file
        output_dir: Directory to write the split ZIP files
        max_size: Maximum size per output file in bytes (default: 5GB)

    Returns:
        Number of parts created
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    part = 1
    current_size = 0
    prefix_name = os.path.splitext(os.path.basename(input_zip))[0]
    output_zip = zipfile.ZipFile(
        os.path.join(output_dir, f"{prefix_name}_part_{part}.zip"),
        "w",
        zipfile.ZIP_DEFLATED,
    )

    with zipfile.ZipFile(input_zip, "r") as zip_ref:
        for file in zip_ref.namelist():
            file_data = zip_ref.read(file)
            file_size = len(file_data)

            if current_size + file_size > max_size:
                output_zip.close()
                part += 1
                current_size = 0
                output_zip = zipfile.ZipFile(
                    os.path.join(output_dir, f"{prefix_name}_part_{part}.zip"),
                    "w",
                    zipfile.ZIP_DEFLATED,
                )

            output_zip.writestr(file, file_data)
            current_size += file_size

    output_zip.close()
    return part

The splitting algorithm: 1. Creates output directory if it doesn't exist 2. Extracts the base name from input ZIP for naming parts 3. Iterates through files in the source ZIP 4. Tracks cumulative size in current output ZIP 5. Creates a new part when adding a file would exceed max_size 6. Uses `ZIP_DEFLATED` compression for output files 7. Returns the total number of parts created

Command-Line Interface

def main() -> None:
    parser = argparse.ArgumentParser(
        description="Split large ZIP files into smaller parts.",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
    %(prog)s dataset.zip ./split_output/
    %(prog)s dataset.zip ./split_output/ --max-size 2GB
    %(prog)s large_archive.zip ./parts/ --max-size 500MB
        """,
    )
    parser.add_argument("input_zip", help="Path to the input ZIP file")
    parser.add_argument("output_dir", help="Directory to write split ZIP files")
    parser.add_argument(
        "--max-size",
        default="5GB",
        help="Maximum size per output file (default: 5GB). "
        "Supports units: B, KB, MB, GB, TB",
    )

    args = parser.parse_args()

    if not os.path.exists(args.input_zip):
        parser.error(f"Input file not found: {args.input_zip}")

    max_bytes = parse_size(args.max_size)
    num_parts = split_zip(args.input_zip, args.output_dir, max_bytes)
    print(f"Split into {num_parts} part(s) in {args.output_dir}")


if __name__ == "__main__":
    main()

The CLI:

  • Provides clear usage examples in the epilog
  • Validates input file existence
  • Uses human-readable size specification
  • Reports completion status with part count

Usage Examples

# Split with default 5GB limit
python get_split_zip.py dataset.zip ./split_output/

# Split with custom 2GB limit
python get_split_zip.py dataset.zip ./split_output/ --max-size 2GB

# Split with 500MB limit
python get_split_zip.py large_archive.zip ./parts/ --max-size 500MB

Design Patterns

File-at-a-Time Splitting

The tool reads and writes entire files atomically rather than splitting individual files. This ensures:

  • Each file appears complete in one part
  • No need for reassembly logic beyond extracting all parts
  • Simpler implementation and usage

Size-Based Partitioning

The partitioning strategy is purely size-based, starting a new part whenever adding the next file would exceed the threshold.

Greedy Algorithm

Files are added to the current part until it would exceed the limit, then a new part is started. This simple greedy approach is efficient and predictable.

In-Memory Processing

Files are read entirely into memory (`zip_ref.read(file)`), which is suitable for most dataset files but may not work for extremely large individual files exceeding available RAM.

Compression Preservation

Output files use `ZIP_DEFLATED` compression, matching typical dataset ZIP file conventions.

Limitations

  • Individual file size: If a single file exceeds `max_size`, it will create an over-sized part
  • Memory usage: Each file is loaded entirely into memory during processing
  • No parallelization: Files are processed sequentially

Related Components

  • Principle: Data_Tooling
  • Use case: Preparing datasets for Hugging Face Hub upload with 5GB limit

Dependencies

  • argparse: Command-line argument parsing
  • os: File system operations
  • zipfile: ZIP file reading and writing

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment