Principle:EvolvingLMMs Lab Lmms eval Data Tooling
Overview
The Data Tooling principle encompasses utilities and scripts for preparing, processing, and managing datasets used in lmms_eval. These tools handle common data operations like splitting large archives, format conversion, and dataset preparation for hosting platforms.
Core Concepts
Dataset Preparation
Tools should facilitate preparing datasets for distribution:
- Splitting large files to comply with hosting limits
- Format conversion for compatibility
- Archive manipulation for efficient storage
- Metadata generation
Size Management
Many hosting platforms impose file size limits:
- Hugging Face Hub: 5GB per file
- GitHub releases: 2GB recommended limit
- Tools should split or compress data accordingly
Reproducibility
Data tools should:
- Preserve file integrity
- Use deterministic algorithms
- Document transformations
- Support verification
Design Principles
Human-Readable Configuration
Tools should accept human-friendly inputs:
- Size specifications with units (GB, MB, KB)
- Natural command-line interfaces
- Clear examples in help text
Idempotency
Operations should be safe to retry:
- Check for existing outputs
- Avoid partial writes
- Clean up on failure
Atomic File Operations
Individual files should remain intact:
- Don't split files across archives
- Process complete files at a time
- Maintain file structure within archives
Clear Output Structure
Output should be predictable:
- Consistent naming conventions
- Sequential part numbering
- Preserve original base names
Implementation Guidelines
Size Parsing
Size parsers should:
- Support standard units (B, KB, MB, GB, TB)
- Handle both integer and float values
- Be case-insensitive
- Fall back to bytes for raw numbers
Archive Splitting
Splitting tools should:
- Track cumulative size per output
- Start new part when threshold exceeded
- Use consistent compression
- Return part counts for verification
Command-Line Interface
CLI tools should:
- Use argparse for consistency
- Provide clear help text with examples
- Validate inputs before processing
- Report progress and completion status
Error Handling
Tools should:
- Validate input files exist
- Create output directories as needed
- Provide clear error messages
- Exit with appropriate status codes
Key Operations
ZIP File Splitting
Split large ZIP archives into size-limited parts:
- Read source ZIP
- Track file sizes
- Distribute files across output ZIPs
- Maintain compression settings
Format Conversion
Convert between dataset formats:
- Preserve data integrity
- Handle metadata appropriately
- Support common formats (JSON, Parquet, CSV)
Archive Extraction
Selectively extract archive contents:
- Filter by pattern
- Preserve directory structure
- Handle nested archives
Usage Patterns
Splitting for Hugging Face
# Split dataset for HF Hub (5GB limit)
python tools/get_split_zip.py dataset.zip ./hf_upload/ --max-size 5GB
# Upload each part
for part in ./hf_upload/*.zip; do
huggingface-cli upload-file $part
done
Size Specification
# Various size specifications
python tools/get_split_zip.py large.zip out/ --max-size 2GB
python tools/get_split_zip.py medium.zip out/ --max-size 500MB
python tools/get_split_zip.py small.zip out/ --max-size 1024 # bytes
Best Practices
For Archive Tools
- Read files into memory only when necessary
- Use streaming for very large files
- Preserve original compression settings
- Generate checksums for verification
For CLI Design
- Provide sensible defaults
- Include example usage
- Validate inputs early
- Show progress for long operations
For Output Management
- Create directories automatically
- Use predictable naming schemes
- Avoid overwriting without confirmation
- Report what was created
For Size Handling
- Support fraction values (e.g., "2.5GB")
- Document unit conventions
- Handle edge cases (empty files, single large file)
Common Use Cases
Preparing Dataset for Upload
1. Split large dataset ZIP into parts 2. Verify each part is under limit 3. Upload parts to hosting platform 4. Document part assembly process
Dataset Format Conversion
1. Load source format 2. Transform to target schema 3. Write target format 4. Verify data integrity
Selective Dataset Extraction
1. Identify needed subset 2. Extract relevant files 3. Repackage in smaller archive 4. Verify completeness
Error Handling Strategies
Input Validation
- Check file existence before processing
- Verify file format/type
- Validate size limits are reasonable
Processing Errors
- Handle corrupt archives gracefully
- Report specific file causing issues
- Clean up partial outputs on failure
Output Validation
- Verify all files processed
- Check output sizes
- Validate archive integrity
Performance Considerations
Memory Usage
- Process files incrementally when possible
- Monitor memory for large file operations
- Consider streaming for very large datasets
Disk I/O
- Minimize read/write operations
- Use appropriate buffer sizes
- Consider compression trade-offs
Parallelization
- Process independent files concurrently
- Balance CPU and I/O utilization
- Consider diminishing returns
Related Principles
- Dataset_Preparation: Preparing datasets for evaluation
- Environment_Setup: Setting up data dependencies
Implementations
- Get_Split_Zip: ZIP file splitting utility
- Implementation:EvolvingLMMs_Lab_Lmms_eval_Get_Split_Zip