Implementation:EvolvingLMMs Lab Lmms eval VATEX Utils
Source File: `lmms_eval/tasks/vatex/utils.py`
Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]
Overview
The VATEX Utils module provides evaluation functions for the VATEX (Video And Text EXtraction) benchmark, a multilingual video captioning dataset. It supports both English and Chinese caption generation with few-shot prompting and handles validation and test set evaluation with submission file generation.
Key Functions
Document Processing
vatex_ZH_doc_to_visual(doc)- Prepares video path for Chinese validation set
- Reads YAML configuration to get cache directory
- Constructs video path from video ID and cache location
- Checks multiple file extensions: .mp4, .MP4, .mkv
- Exits with error if video not found
- Returns list containing video file path
vatex_test_doc_to_visual(doc)- Prepares video path for test set
- Similar logic to Chinese validation function
- Reads test set YAML configuration
- Handles multiple video formats
- Returns list containing video file path
Prompt Generation
vatex_ZH_doc_to_text(doc, lmms_eval_specific_kwargs=None)- Generates prompt for Chinese caption generation
- Includes 4-shot examples in Chinese:
- Video 1: Mountain climbing scene
- Video 2: Simulated drumming
- Video 3: Hand gestures at desk
- Video 4: Applying face cream
- Appends configured prompt from kwargs
- Returns formatted prompt with examples
- Includes 4-shot examples in Chinese:
vatex_test_doc_to_text(doc, lmms_eval_specific_kwargs=None)- Generates prompt for English test set
- Includes 4-shot examples in English:
- Video 1: Shoe care items
- Video 2: Cooking with frying pan
- Video 3: Cross stitch demonstration
- Video 4: Girl doing flips
- Appends configured prompt from kwargs
- Returns formatted prompt with examples
- Includes 4-shot examples in English:
Results Processing
vatex_process_result(doc, result)- Processes English caption predictions
- Extracts prediction from result list
- Creates data dictionary with:
- English ground truth captions (
enCap) - Model prediction
- Video ID
- English ground truth captions (
- Returns dictionary mapping each metric to the data dictionary
vatex_process_CN_result(doc, result)- Processes Chinese caption predictions
- Similar to English processing
- Uses Chinese ground truth captions (
chCap) - Returns metric-mapped data dictionary
vatex_test_process_result(doc, result)- Processes test set predictions
- Creates passthrough structure (no metrics computed)
- Returns dictionary with image ID and prediction
Metrics Aggregation
vatex_aggregation_result(results, metric, args=None)- Aggregates predictions and computes validation metrics
- Creates COCO-format dataset structure:
- Uses video IDs as image IDs
- Multiple reference captions per video
- Initializes COCO evaluation pipeline
- Tokenizes using PTBTokenizer
- Computes requested metric
- Handles Bleu score list extraction
- Generates submission file "vatex_captions_val_results.json"
- Saves predictions to JSON
- Returns scalar metric score
- Creates COCO-format dataset structure:
vatex_test_aggregation_result(results, args)- Aggregates test set predictions for submission
- Collects predictions with image IDs
- Generates submission file "vatex_captions_test2014_alg_results.json"
- Provides submission instructions
- No metrics computed
Metric-Specific Functions
vatex_bleu1(results, args=None)throughvatex_bleu4(results, args=None)- Compute BLEU scores at n-gram levels 1-4
vatex_meteor(results, args=None)- Computes METEOR score
vatex_rougel(results, args=None)- Computes ROUGE-L score
vatex_cider(results, args=None)- Computes CIDEr score
vatex_spice(results, args=None)- Computes SPICE score
Configuration
Active Metrics
VATEX_METRICS = ["Bleu_4", "Bleu_3", "Bleu_2", "Bleu_1",
"METEOR", "ROUGE_L", "CIDEr"]
Cache Directory
Base cache directory is determined from:
- Environment variable
HF_HOME - Default:
~/.cache/huggingface/ - Task-specific cache path read from YAML configuration
Few-Shot Examples
Both English and Chinese prompts include 4 example videos with reference captions to demonstrate the task format.
Design Characteristics
- Multilingual Support: Separate processing for English and Chinese captions
- Few-Shot Learning: Includes example videos in prompts
- Video Format Handling: Checks multiple video file extensions
- Configuration-Driven: Reads cache paths from YAML files
- Multi-Reference Evaluation: Supports multiple ground truth captions per video
- Submission Generation: Creates files for server evaluation
- COCO Evaluation Framework: Uses standard captioning metrics
Dependencies
json- JSON operations for submission filesos- File system operationssys- System exit on errorspathlib.Path- Path manipulationyaml- YAML configuration parsingloguru.logger- Loggingpycocoevalcap.eval- Captioning metrics (Bleu, Cider, Meteor, Rouge)pycocoevalcap.tokenizer.ptbtokenizer.PTBTokenizer- Tokenizationpycocotools.coco.COCO- COCO dataset handlinglmms_eval.tasks._task_utils.file_utils.generate_submission_file- File generation
Usage Context
This module supports the VATEX video captioning benchmark, which tests models' ability to describe video content in natural language. It handles both English and Chinese evaluation, uses few-shot prompting to guide models, and generates submission files for official benchmark evaluation.