Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:EvolvingLMMs Lab Lmms eval Data Tooling

From Leeroopedia

Overview

The Data Tooling principle encompasses utilities and scripts for preparing, processing, and managing datasets used in lmms_eval. These tools handle common data operations like splitting large archives, format conversion, and dataset preparation for hosting platforms.

Core Concepts

Dataset Preparation

Tools should facilitate preparing datasets for distribution:

  • Splitting large files to comply with hosting limits
  • Format conversion for compatibility
  • Archive manipulation for efficient storage
  • Metadata generation

Size Management

Many hosting platforms impose file size limits:

  • Hugging Face Hub: 5GB per file
  • GitHub releases: 2GB recommended limit
  • Tools should split or compress data accordingly

Reproducibility

Data tools should:

  • Preserve file integrity
  • Use deterministic algorithms
  • Document transformations
  • Support verification

Design Principles

Human-Readable Configuration

Tools should accept human-friendly inputs:

  • Size specifications with units (GB, MB, KB)
  • Natural command-line interfaces
  • Clear examples in help text

Idempotency

Operations should be safe to retry:

  • Check for existing outputs
  • Avoid partial writes
  • Clean up on failure

Atomic File Operations

Individual files should remain intact:

  • Don't split files across archives
  • Process complete files at a time
  • Maintain file structure within archives

Clear Output Structure

Output should be predictable:

  • Consistent naming conventions
  • Sequential part numbering
  • Preserve original base names

Implementation Guidelines

Size Parsing

Size parsers should:

  • Support standard units (B, KB, MB, GB, TB)
  • Handle both integer and float values
  • Be case-insensitive
  • Fall back to bytes for raw numbers

Archive Splitting

Splitting tools should:

  • Track cumulative size per output
  • Start new part when threshold exceeded
  • Use consistent compression
  • Return part counts for verification

Command-Line Interface

CLI tools should:

  • Use argparse for consistency
  • Provide clear help text with examples
  • Validate inputs before processing
  • Report progress and completion status

Error Handling

Tools should:

  • Validate input files exist
  • Create output directories as needed
  • Provide clear error messages
  • Exit with appropriate status codes

Key Operations

ZIP File Splitting

Split large ZIP archives into size-limited parts:

  • Read source ZIP
  • Track file sizes
  • Distribute files across output ZIPs
  • Maintain compression settings

Format Conversion

Convert between dataset formats:

  • Preserve data integrity
  • Handle metadata appropriately
  • Support common formats (JSON, Parquet, CSV)

Archive Extraction

Selectively extract archive contents:

  • Filter by pattern
  • Preserve directory structure
  • Handle nested archives

Usage Patterns

Splitting for Hugging Face

# Split dataset for HF Hub (5GB limit)
python tools/get_split_zip.py dataset.zip ./hf_upload/ --max-size 5GB

# Upload each part
for part in ./hf_upload/*.zip; do
    huggingface-cli upload-file $part
done

Size Specification

# Various size specifications
python tools/get_split_zip.py large.zip out/ --max-size 2GB
python tools/get_split_zip.py medium.zip out/ --max-size 500MB
python tools/get_split_zip.py small.zip out/ --max-size 1024  # bytes

Best Practices

For Archive Tools

  • Read files into memory only when necessary
  • Use streaming for very large files
  • Preserve original compression settings
  • Generate checksums for verification

For CLI Design

  • Provide sensible defaults
  • Include example usage
  • Validate inputs early
  • Show progress for long operations

For Output Management

  • Create directories automatically
  • Use predictable naming schemes
  • Avoid overwriting without confirmation
  • Report what was created

For Size Handling

  • Support fraction values (e.g., "2.5GB")
  • Document unit conventions
  • Handle edge cases (empty files, single large file)

Common Use Cases

Preparing Dataset for Upload

1. Split large dataset ZIP into parts 2. Verify each part is under limit 3. Upload parts to hosting platform 4. Document part assembly process

Dataset Format Conversion

1. Load source format 2. Transform to target schema 3. Write target format 4. Verify data integrity

Selective Dataset Extraction

1. Identify needed subset 2. Extract relevant files 3. Repackage in smaller archive 4. Verify completeness

Error Handling Strategies

Input Validation

  • Check file existence before processing
  • Verify file format/type
  • Validate size limits are reasonable

Processing Errors

  • Handle corrupt archives gracefully
  • Report specific file causing issues
  • Clean up partial outputs on failure

Output Validation

  • Verify all files processed
  • Check output sizes
  • Validate archive integrity

Performance Considerations

Memory Usage

  • Process files incrementally when possible
  • Monitor memory for large file operations
  • Consider streaming for very large datasets

Disk I/O

  • Minimize read/write operations
  • Use appropriate buffer sizes
  • Consider compression trade-offs

Parallelization

  • Process independent files concurrently
  • Balance CPU and I/O utilization
  • Consider diminishing returns

Related Principles

Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment