Implementation:EvolvingLMMs Lab Lmms eval Get Split Zip
File: `/tmp/kapso_repo_sslb_59s/tools/get_split_zip.py`
Principle: Data_Tooling
Overview
The Get Split Zip tool splits large ZIP files into smaller parts to comply with hosting limits (e.g., Hugging Face's 5GB per-file limit). It reads files from a source ZIP and distributes them across multiple output ZIPs based on a configurable size threshold.
Key Components
Size Parsing
def parse_size(size_str: str) -> int:
"""Parse human-readable size string to bytes.
Args:
size_str: Size string like "5GB", "500MB", "1024" (bytes)
Returns:
Size in bytes
"""
size_str = size_str.strip().upper()
units = {"B": 1, "KB": 1024, "MB": 1024**2, "GB": 1024**3, "TB": 1024**4}
for unit, multiplier in units.items():
if size_str.endswith(unit):
return int(float(size_str[: -len(unit)]) * multiplier)
return int(size_str)
The parser:
- Supports units: B, KB, MB, GB, TB
- Handles both integer and float values (e.g., "2.5GB")
- Falls back to treating raw numbers as bytes
- Case-insensitive unit parsing
ZIP Splitting Logic
def split_zip(input_zip: str, output_dir: str, max_size: int = 5 * 1024**3) -> int:
"""Split a ZIP file into multiple smaller ZIP files.
Args:
input_zip: Path to the input ZIP file
output_dir: Directory to write the split ZIP files
max_size: Maximum size per output file in bytes (default: 5GB)
Returns:
Number of parts created
"""
if not os.path.exists(output_dir):
os.makedirs(output_dir)
part = 1
current_size = 0
prefix_name = os.path.splitext(os.path.basename(input_zip))[0]
output_zip = zipfile.ZipFile(
os.path.join(output_dir, f"{prefix_name}_part_{part}.zip"),
"w",
zipfile.ZIP_DEFLATED,
)
with zipfile.ZipFile(input_zip, "r") as zip_ref:
for file in zip_ref.namelist():
file_data = zip_ref.read(file)
file_size = len(file_data)
if current_size + file_size > max_size:
output_zip.close()
part += 1
current_size = 0
output_zip = zipfile.ZipFile(
os.path.join(output_dir, f"{prefix_name}_part_{part}.zip"),
"w",
zipfile.ZIP_DEFLATED,
)
output_zip.writestr(file, file_data)
current_size += file_size
output_zip.close()
return part
The splitting algorithm: 1. Creates output directory if it doesn't exist 2. Extracts the base name from input ZIP for naming parts 3. Iterates through files in the source ZIP 4. Tracks cumulative size in current output ZIP 5. Creates a new part when adding a file would exceed max_size 6. Uses `ZIP_DEFLATED` compression for output files 7. Returns the total number of parts created
Command-Line Interface
def main() -> None:
parser = argparse.ArgumentParser(
description="Split large ZIP files into smaller parts.",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s dataset.zip ./split_output/
%(prog)s dataset.zip ./split_output/ --max-size 2GB
%(prog)s large_archive.zip ./parts/ --max-size 500MB
""",
)
parser.add_argument("input_zip", help="Path to the input ZIP file")
parser.add_argument("output_dir", help="Directory to write split ZIP files")
parser.add_argument(
"--max-size",
default="5GB",
help="Maximum size per output file (default: 5GB). "
"Supports units: B, KB, MB, GB, TB",
)
args = parser.parse_args()
if not os.path.exists(args.input_zip):
parser.error(f"Input file not found: {args.input_zip}")
max_bytes = parse_size(args.max_size)
num_parts = split_zip(args.input_zip, args.output_dir, max_bytes)
print(f"Split into {num_parts} part(s) in {args.output_dir}")
if __name__ == "__main__":
main()
The CLI:
- Provides clear usage examples in the epilog
- Validates input file existence
- Uses human-readable size specification
- Reports completion status with part count
Usage Examples
# Split with default 5GB limit
python get_split_zip.py dataset.zip ./split_output/
# Split with custom 2GB limit
python get_split_zip.py dataset.zip ./split_output/ --max-size 2GB
# Split with 500MB limit
python get_split_zip.py large_archive.zip ./parts/ --max-size 500MB
Design Patterns
File-at-a-Time Splitting
The tool reads and writes entire files atomically rather than splitting individual files. This ensures:
- Each file appears complete in one part
- No need for reassembly logic beyond extracting all parts
- Simpler implementation and usage
Size-Based Partitioning
The partitioning strategy is purely size-based, starting a new part whenever adding the next file would exceed the threshold.
Greedy Algorithm
Files are added to the current part until it would exceed the limit, then a new part is started. This simple greedy approach is efficient and predictable.
In-Memory Processing
Files are read entirely into memory (`zip_ref.read(file)`), which is suitable for most dataset files but may not work for extremely large individual files exceeding available RAM.
Compression Preservation
Output files use `ZIP_DEFLATED` compression, matching typical dataset ZIP file conventions.
Limitations
- Individual file size: If a single file exceeds `max_size`, it will create an over-sized part
- Memory usage: Each file is loaded entirely into memory during processing
- No parallelization: Files are processed sequentially
Related Components
- Principle: Data_Tooling
- Use case: Preparing datasets for Hugging Face Hub upload with 5GB limit
Dependencies
- argparse: Command-line argument parsing
- os: File system operations
- zipfile: ZIP file reading and writing