Overview
Utility functions for VDC (Video Dense Captioning) benchmark evaluation using SGLang-based LLM judge to assess caption quality through QA-based verification.
Description
This module implements VDC evaluation using a two-stage pipeline: (1) Model generates detailed video captions (short, detailed, main_object, camera, or background descriptions), (2) SGLang-powered LLM judge generates QA pairs from captions and scores answer correctness. The evaluation uses randomized prompt templates for each caption type to test model robustness, and calculates both accuracy (yes/no matches) and quality scores (0-5 scale). Requires SGLang server running locally on port 30000 for LLM-based evaluation.
Usage
Use this when evaluating video understanding models on dense captioning tasks. Configure HF_HOME environment variable for video cache location. The module supports five captioning modes: short (one-sentence summary), detailed (3+ sentences), main_object (subject actions/movements), camera (viewpoint/movements), and background (environment/setting). Requires SGLang server for llmms_eval scoring.
Code Reference
Source Location
Signature
def vdc_doc_to_visual(doc: Dict) -> List[str]
def vdc_doc_to_text_short(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_detailed(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_main_object(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_camera(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_background(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_answer(doc: Dict) -> str
@function
def gener_pred_response(s, pred_cap: str, q: str)
@function
def gener_pred_score(s, qa: Dict)
def llmms_eval(data_dict: Dict) -> Dict
def vdc_process_results_generic(doc: Dict, result: List[str]) -> Dict
def vdc_aggregate_score(results: List[Dict], args: Any) -> float
def vdc_aggregate_acc(results: List[Dict], args: Any) -> float
Import
from lmms_eval.tasks.vdc.utils import (
vdc_doc_to_visual,
vdc_doc_to_text_detailed,
vdc_process_results_generic,
vdc_aggregate_score,
vdc_aggregate_acc
)
I/O Contract
vdc_doc_to_visual Input
| Field |
Type |
Description
|
| doc["video_name"] |
str |
Video filename without extension (e.g., "video_001")
|
vdc_doc_to_visual Output
| Field |
Type |
Description
|
| video_path |
List[str] |
Single-element list with absolute path to video file (.mp4, .MP4, or .mkv)
|
llmms_eval Input
| Field |
Type |
Description
|
| data_dict["video_name"] |
str |
Video identifier
|
| data_dict["pred"] |
str |
Model-generated caption
|
| data_dict["qa_list"] |
List[Dict] |
QA pairs with keys: question, answer
|
llmms_eval Output
| Field |
Type |
Description
|
| video_name |
str |
Video identifier
|
| score |
float |
Average quality score (0.0 to 5.0)
|
| acc |
float |
Proportion of correct answers (0.0 to 1.0)
|
Prompt Templates
| Type |
Description |
Example Prompt
|
| SHORT_CAPTION |
One-sentence summary |
"Write a one-sentence summary of the video."
|
| DETAILED_CAPTION |
3+ sentence description |
"Provide a faithfully detailed description of this video in more than three sentences."
|
| MAIN_OBJECT_CAPTION |
Subject actions/movements |
"Description of the main subject actions or status sequence..."
|
| CAMERA_CAPTION |
Camera movements/angles |
"Summary of the view shot, camera movement and changes in shooting angles..."
|
| BACKGROUND_CAPTION |
Environment/setting |
"Summary of the background including objects, location, weather, and time."
|
Usage Examples
import os
from lmms_eval.tasks.vdc.utils import (
vdc_doc_to_visual,
vdc_doc_to_text_detailed,
vdc_process_results_generic,
llmms_eval
)
# Set up environment
os.environ["HF_HOME"] = "/path/to/cache"
# Load video path
doc = {
"video_name": "sample_video",
"caption": "A person walks through a park on a sunny day",
"qa_list": [
{"question": "What is the person doing?", "answer": "walking"},
{"question": "What is the weather?", "answer": "sunny"}
]
}
video_paths = vdc_doc_to_visual(doc)
print(video_paths[0]) # /path/to/cache/Test_Videos/sample_video.mp4
# Generate detailed caption prompt (randomized)
prompt = vdc_doc_to_text_detailed(doc)
print(prompt)
# "Please imagine the video based on the sequence of frames, and provide
# a faithfully detailed description of this video in more than three sentences."
# Process model results with LLM judge
model_caption = [
"A person is walking through a green park with trees. "
"The weather appears sunny with clear skies. "
"The person walks at a steady pace along a paved path."
]
results = vdc_process_results_generic(doc, model_caption)
print(results["llmms_eval_score"]["score"]) # e.g., 4.5 (out of 5)
print(results["llmms_eval_acc"]["acc"]) # e.g., 1.0 (both QAs correct)
# Aggregate scores across dataset
all_scores = [
{"video_name": "vid1", "score": 4.5, "acc": 1.0},
{"video_name": "vid2", "score": 3.8, "acc": 0.8},
{"video_name": "vid3", "score": 4.2, "acc": 0.9}
]
avg_score = vdc_aggregate_score(all_scores, args=None)
avg_acc = vdc_aggregate_acc(all_scores, args=None)
print(f"Average Score: {avg_score:.2f}") # 4.17
print(f"Average Accuracy: {avg_acc:.2f}") # 0.90
# Direct LLM evaluation (requires SGLang server)
from sglang import set_default_backend, RuntimeEndpoint
set_default_backend(RuntimeEndpoint("http://localhost:30000"))
eval_data = {
"video_name": "sample_video",
"pred": model_caption[0],
"qa_list": doc["qa_list"]
}
eval_result = llmms_eval(eval_data)
print(f"Case Score: {eval_result['score']:.2f}")
print(f"Case Accuracy: {eval_result['acc']:.2f}")
Related Pages