Implementation:EvolvingLMMs Lab Lmms eval Get Split Zip

File: `/tmp/kapso_repo_sslb_59s/tools/get_split_zip.py`

Overview

The Get Split Zip tool splits large ZIP files into smaller parts to comply with hosting limits (e.g., Hugging Face's 5GB per-file limit). It reads files from a source ZIP and distributes them across multiple output ZIPs based on a configurable size threshold.

Key Components

Size Parsing

def parse_size(size_str: str) -> int:
    """Parse human-readable size string to bytes.

    Args:
        size_str: Size string like "5GB", "500MB", "1024" (bytes)

    Returns:
        Size in bytes
    """
    size_str = size_str.strip().upper()
    units = {"B": 1, "KB": 1024, "MB": 1024**2, "GB": 1024**3, "TB": 1024**4}

    for unit, multiplier in units.items():
        if size_str.endswith(unit):
            return int(float(size_str[: -len(unit)]) * multiplier)

    return int(size_str)

The parser:

Supports units: B, KB, MB, GB, TB
Handles both integer and float values (e.g., "2.5GB")
Falls back to treating raw numbers as bytes
Case-insensitive unit parsing

ZIP Splitting Logic

def split_zip(input_zip: str, output_dir: str, max_size: int = 5 * 1024**3) -> int:
    """Split a ZIP file into multiple smaller ZIP files.

    Args:
        input_zip: Path to the input ZIP file
        output_dir: Directory to write the split ZIP files
        max_size: Maximum size per output file in bytes (default: 5GB)

    Returns:
        Number of parts created
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    part = 1
    current_size = 0
    prefix_name = os.path.splitext(os.path.basename(input_zip))[0]
    output_zip = zipfile.ZipFile(
        os.path.join(output_dir, f"{prefix_name}_part_{part}.zip"),
        "w",
        zipfile.ZIP_DEFLATED,
    )

    with zipfile.ZipFile(input_zip, "r") as zip_ref:
        for file in zip_ref.namelist():
            file_data = zip_ref.read(file)
            file_size = len(file_data)

            if current_size + file_size > max_size:
                output_zip.close()
                part += 1
                current_size = 0
                output_zip = zipfile.ZipFile(
                    os.path.join(output_dir, f"{prefix_name}_part_{part}.zip"),
                    "w",
                    zipfile.ZIP_DEFLATED,
                )

            output_zip.writestr(file, file_data)
            current_size += file_size

    output_zip.close()
    return part

The splitting algorithm: 1. Creates output directory if it doesn't exist 2. Extracts the base name from input ZIP for naming parts 3. Iterates through files in the source ZIP 4. Tracks cumulative size in current output ZIP 5. Creates a new part when adding a file would exceed max_size 6. Uses `ZIP_DEFLATED` compression for output files 7. Returns the total number of parts created

Command-Line Interface

def main() -> None:
    parser = argparse.ArgumentParser(
        description="Split large ZIP files into smaller parts.",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
    %(prog)s dataset.zip ./split_output/
    %(prog)s dataset.zip ./split_output/ --max-size 2GB
    %(prog)s large_archive.zip ./parts/ --max-size 500MB
        """,
    )
    parser.add_argument("input_zip", help="Path to the input ZIP file")
    parser.add_argument("output_dir", help="Directory to write split ZIP files")
    parser.add_argument(
        "--max-size",
        default="5GB",
        help="Maximum size per output file (default: 5GB). "
        "Supports units: B, KB, MB, GB, TB",
    )

    args = parser.parse_args()

    if not os.path.exists(args.input_zip):
        parser.error(f"Input file not found: {args.input_zip}")

    max_bytes = parse_size(args.max_size)
    num_parts = split_zip(args.input_zip, args.output_dir, max_bytes)
    print(f"Split into {num_parts} part(s) in {args.output_dir}")


if __name__ == "__main__":
    main()

The CLI:

Provides clear usage examples in the epilog
Validates input file existence
Uses human-readable size specification
Reports completion status with part count

Usage Examples

# Split with default 5GB limit
python get_split_zip.py dataset.zip ./split_output/

# Split with custom 2GB limit
python get_split_zip.py dataset.zip ./split_output/ --max-size 2GB

# Split with 500MB limit
python get_split_zip.py large_archive.zip ./parts/ --max-size 500MB

Design Patterns

File-at-a-Time Splitting

The tool reads and writes entire files atomically rather than splitting individual files. This ensures:

Each file appears complete in one part
No need for reassembly logic beyond extracting all parts
Simpler implementation and usage

Size-Based Partitioning

The partitioning strategy is purely size-based, starting a new part whenever adding the next file would exceed the threshold.

Greedy Algorithm

Files are added to the current part until it would exceed the limit, then a new part is started. This simple greedy approach is efficient and predictable.

In-Memory Processing

Files are read entirely into memory (`zip_ref.read(file)`), which is suitable for most dataset files but may not work for extremely large individual files exceeding available RAM.

Compression Preservation

Output files use `ZIP_DEFLATED` compression, matching typical dataset ZIP file conventions.

Limitations

Individual file size: If a single file exceeds `max_size`, it will create an over-sized part
Memory usage: Each file is loaded entirely into memory during processing
No parallelization: Files are processed sequentially

Related Components

Principle: Data_Tooling
Use case: Preparing datasets for Hugging Face Hub upload with 5GB limit

Dependencies

argparse: Command-line argument parsing
os: File system operations
zipfile: ZIP file reading and writing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment