Principle:EvolvingLMMs Lab Lmms eval Regression Testing Principle
Overview
The Regression Testing principle defines the methodology for automated performance comparison across code changes in lmms_eval. It enables developers to validate that changes don't degrade model performance while potentially improving speed or capabilities.
Core Concepts
Baseline Comparison
All regression tests compare against a baseline:
- Current branch serves as baseline
- Comparison branches are evaluated in sequence
- Differences are computed and highlighted
- Both improvements and regressions are detected
Multi-Dimensional Testing
Regression tests evaluate multiple aspects:
- Accuracy: Model performance on tasks
- Runtime: Execution speed
- Memory: Resource utilization (potential)
- Consistency: Result reproducibility
Branch-Based Testing
Tests operate on git branches:
- Automatic branch switching
- Clean state between tests
- Return to original branch after completion
- Suitable for CI/CD integration
Statistical Reporting
Results are presented in structured format:
- Markdown tables for readability
- Percentage values for interpretability
- Difference highlighting (bold for improvements)
- Runtime comparison as percentage of baseline
Design Principles
Automated Workflow
The testing process should be:
- Fully automated from command invocation to result reporting
- Require minimal manual intervention
- Handle branch switching automatically
- Clean up state after completion
Comprehensive Coverage
Tests should cover:
- Representative task selection (single-image, multi-image, video)
- Multiple models when appropriate
- Key performance metrics for each task
- Both accuracy and speed
Clear Communication
Results should be:
- Easy to interpret at a glance
- Suitable for GitHub issues/PRs
- Highlight significant changes
- Include both absolute values and differences
Reproducible Execution
Test runs should be:
- Deterministic when possible (temperature=0)
- Use consistent hardware configuration
- Document all parameters
- Save full results for analysis
Implementation Guidelines
Baseline Establishment
The baseline should:
- Run first before any branch switching
- Use current branch state
- Capture full result metadata
- Measure runtime accurately
Branch Evaluation
For each comparison branch: 1. Switch to the branch 2. Run identical evaluation 3. Capture results and runtime 4. Return to baseline branch 5. Compare against baseline
Metric Selection
Choose metrics that:
- Represent task performance accurately
- Are consistent across runs
- Align with published benchmarks
- Can be extracted automatically
Result Formatting
Format results as:
- Markdown tables for GitHub
- Percentage values for clarity
- Bold formatting for positive changes
- Clear branch identification
Key Operations
Model Evaluation
Run evaluation with:
- Distributed inference (accelerate)
- Fixed random seed for reproducibility
- Consistent generation parameters
- Timestamped output paths
Metric Extraction
Extract task-specific metrics:
- Map tasks to primary metrics
- Handle missing results gracefully
- Support multiple metric types
- Return zero for unavailable metrics
Difference Calculation
Compare results:
- Compute absolute differences
- Convert to percentages
- Determine improvement vs. regression
- Format with appropriate precision
Runtime Analysis
Analyze execution time:
- Measure total runtime per branch
- Compute percentage relative to baseline
- Identify performance optimizations
- Detect performance regressions
Usage Patterns
Pre-Merge Regression Testing
# Test feature branch before merging
git checkout main
python tools/regression.py --branches feature-branch
# Review output tables in terminal
# Post to PR for review
Multi-Branch Comparison
# Compare several optimization approaches
python tools/regression.py \
--branches opt-v1,opt-v2,opt-v3 \
--tasks ocrbench,mmmu_val \
--limit 100
Quick Smoke Test
# Fast test with small limit
python tools/regression.py \
--branches experimental \
--tasks ai2d \
--limit 10
Evaluation Configuration
Distributed Inference
Use accelerate for parallelization:
- Multiple processes (typically 8)
- Fixed port to avoid conflicts
- Proper GPU allocation
- Synchronization between processes
Task Selection
Include diverse tasks:
- Single-image: OCRBench, MMMU, AI2D
- Multi-image: MUIRBench
- Video: VideoMME
- Representative of use cases
Generation Parameters
Use consistent settings:
- temperature=0 for deterministic output
- Fixed batch size
- Consistent max_new_tokens
- Same model arguments
Output Format
Performance Table
Structure: ``` |task|model1|model2|...| |--|--|--|...| |task_name (baseline)|score1|score2|...| |task_name (branch)|score1|score2|...| |task_name (diff)|diff1|diff2|...| ```
Characteristics:
- One row per task per branch
- Difference row after each branch
- Bold for positive differences
- Clear branch identification
Runtime Table
Structure: ``` |branch|runtime|%| |--|--|--| |baseline|1234.5s|100%| |branch1|1189.2s|96.33%| ```
Characteristics:
- Absolute runtime in seconds
- Percentage relative to baseline
- Sorted by execution order
Best Practices
For Test Design
- Include representative task mix
- Use meaningful sample sizes
- Balance speed and coverage
- Document expected behaviors
For Result Interpretation
- Consider statistical significance
- Look for patterns across tasks
- Investigate unexpected regressions
- Verify improvements are real
For CI Integration
- Set appropriate thresholds
- Allow for small variations
- Cache models/datasets
- Fail on significant regressions
For Performance Testing
- Run on consistent hardware
- Avoid interference from other processes
- Warm up before timing
- Multiple runs for stability
Error Handling
Branch Switching Failures
- Detect failed checkouts
- Return empty results for failed branches
- Don't abort entire run
- Report which branches failed
Evaluation Failures
- Catch subprocess errors
- Log failure details
- Continue with remaining branches
- Return zero for failed tasks
Result Parsing Failures
- Handle missing JSON files
- Gracefully handle malformed results
- Report parsing errors
- Use zero for missing metrics
Integration Points
Git Integration
- Query current branch
- Checkout comparison branches
- Restore original branch
- Handle dirty working trees
Accelerate Integration
- Launch with correct parameters
- Manage process lifecycle
- Handle port allocation
- Collect distributed results
Task Registry
- Support "all_tasks" keyword
- Pattern matching for task selection
- Validate task existence
- Map to evaluation commands
Limitations and Considerations
Model Alignment
- Assumes model_types align with model list
- Index-based mapping can be fragile
- Consider more robust configuration
Single Metric per Task
- Only primary metric is compared
- Secondary metrics ignored in summary
- Full results available in JSON
Sequential Evaluation
- Branches tested sequentially
- Long runtime for many branches
- Consider parallel branch testing
Working Directory State
- Modifies git state
- May conflict with local changes
- Best run on clean working tree
Related Principles
- Environment_Setup: Setting up evaluation environment
- Model_Inference: Running model inference
- Results_Output: Result file formats
- Metric_Aggregation: Computing metrics
Implementations
- Regression_Testing: Main regression testing script
- Implementation:EvolvingLMMs_Lab_Lmms_eval_Regression_Testing