Environment:Mlfoundations Open flamingo Evaluation Dependencies
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Evaluation, Computer_Vision |
| Last Updated | 2026-02-08 03:30 GMT |
Overview
Evaluation-specific dependencies including pycocoevalcap for CIDEr captioning metrics, NLTK for text processing, scikit-learn for ROC-AUC scoring, and scipy for mathematical operations.
Description
This environment extends the base OpenFlamingo dependencies with evaluation-specific packages. Pycocoevalcap provides COCO captioning evaluation metrics (CIDEr). Pycocotools provides the COCO API for dataset access. NLTK handles text preprocessing for VQA evaluation. Scikit-learn provides the ROC-AUC metric for Hateful Memes classification. The inflection package normalizes VQA answers for scoring.
Usage
Use this environment for the Few-Shot Evaluation workflow. It is required for running captioning evaluation (CIDEr on COCO/Flickr30K), VQA evaluation (accuracy on VQAv2/OK-VQA/VizWiz/TextVQA), and classification evaluation (top-1 accuracy on ImageNet, ROC-AUC on Hateful Memes).
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Disk | Varies by benchmark | COCO images ~20GB, ImageNet ~150GB, others 1-10GB each |
| RAM | 32GB+ recommended | RICES feature caching loads all features into memory |
Dependencies
Python Packages
- `pycocoevalcap`
- `pycocotools`
- `scipy`
- `torchvision`
- `nltk`
- `inflection`
- `scikit-learn`
- `tqdm`
- `requests`
Development Packages
- `black`
- `mypy`
- `pylint`
- `pytest`
Credentials
No credentials are required. Benchmark datasets must be pre-downloaded to local disk.
Quick Install
# Install evaluation extras via setup.py
pip install -e ".[eval]"
# Or install manually
pip install pycocoevalcap pycocotools scipy torchvision nltk inflection scikit-learn tqdm requests
# Or from requirements file
pip install -r requirements-eval.txt
Code Evidence
Evaluation extras from `setup.py:19-27`:
EVAL = [
"scipy",
"torchvision",
"nltk",
"inflection",
"pycocoevalcap",
"pycocotools",
"tqdm",
]
CIDEr metric usage from `open_flamingo/eval/coco_metric.py`:
from pycocoevalcap.eval import COCOEvalCap
ROC-AUC scoring for Hateful Memes from `open_flamingo/eval/evaluate.py:11`:
from sklearn.metrics import roc_auc_score
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: pycocoevalcap` | pycocoevalcap not installed | `pip install pycocoevalcap pycocotools` |
| `FileNotFoundError` on dataset paths | Benchmark data not downloaded | Download COCO, ImageNet, etc. to paths specified in CLI args |
| `Only 0 shot eval is supported for non-open_flamingo models` | Trying few-shot eval with BLIP | Use `--shots 0` for non-OpenFlamingo models |
| `Number of trial seeds must be == number of trials` | Mismatch between `--num_trials` and `--trial_seeds` | Provide matching counts |
Compatibility Notes
- BLIP-2 support: Evaluation framework supports BLIP-2 via separate model wrapper, but only zero-shot evaluation is supported for non-OpenFlamingo models.
- RICES features: Can be pre-cached to disk as `.pkl` files via `cache_rices_features.py` to avoid recomputation.
- VQA test-dev: For VQAv2 and VizWiz, when no test annotations are available, results are formatted for EvalAI submission.