Implementation:EvolvingLMMs Lab Lmms eval VDC Video Captioning Utils

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Vision, Evaluation, Video_Understanding, Captioning
Last Updated	2026-02-14 00:00 GMT

Overview

Utility functions for VDC (Video Dense Captioning) benchmark evaluation using SGLang-based LLM judge to assess caption quality through QA-based verification.

Description

This module implements VDC evaluation using a two-stage pipeline: (1) Model generates detailed video captions (short, detailed, main_object, camera, or background descriptions), (2) SGLang-powered LLM judge generates QA pairs from captions and scores answer correctness. The evaluation uses randomized prompt templates for each caption type to test model robustness, and calculates both accuracy (yes/no matches) and quality scores (0-5 scale). Requires SGLang server running locally on port 30000 for LLM-based evaluation.

Usage

Use this when evaluating video understanding models on dense captioning tasks. Configure HF_HOME environment variable for video cache location. The module supports five captioning modes: short (one-sentence summary), detailed (3+ sentences), main_object (subject actions/movements), camera (viewpoint/movements), and background (environment/setting). Requires SGLang server for llmms_eval scoring.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/tasks/vdc/utils.py

Signature

def vdc_doc_to_visual(doc: Dict) -> List[str]
def vdc_doc_to_text_short(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_detailed(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_main_object(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_camera(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_background(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_answer(doc: Dict) -> str

@function
def gener_pred_response(s, pred_cap: str, q: str)

@function
def gener_pred_score(s, qa: Dict)

def llmms_eval(data_dict: Dict) -> Dict
def vdc_process_results_generic(doc: Dict, result: List[str]) -> Dict
def vdc_aggregate_score(results: List[Dict], args: Any) -> float
def vdc_aggregate_acc(results: List[Dict], args: Any) -> float

Import

from lmms_eval.tasks.vdc.utils import (
    vdc_doc_to_visual,
    vdc_doc_to_text_detailed,
    vdc_process_results_generic,
    vdc_aggregate_score,
    vdc_aggregate_acc
)

I/O Contract

vdc_doc_to_visual Input

Field	Type	Description
doc["video_name"]	str	Video filename without extension (e.g., "video_001")

vdc_doc_to_visual Output

Field	Type	Description
video_path	List[str]	Single-element list with absolute path to video file (.mp4, .MP4, or .mkv)

llmms_eval Input

Field	Type	Description
data_dict["video_name"]	str	Video identifier
data_dict["pred"]	str	Model-generated caption
data_dict["qa_list"]	List[Dict]	QA pairs with keys: question, answer

llmms_eval Output

Field	Type	Description
video_name	str	Video identifier
score	float	Average quality score (0.0 to 5.0)
acc	float	Proportion of correct answers (0.0 to 1.0)

Prompt Templates

Type	Description	Example Prompt
SHORT_CAPTION	One-sentence summary	"Write a one-sentence summary of the video."
DETAILED_CAPTION	3+ sentence description	"Provide a faithfully detailed description of this video in more than three sentences."
MAIN_OBJECT_CAPTION	Subject actions/movements	"Description of the main subject actions or status sequence..."
CAMERA_CAPTION	Camera movements/angles	"Summary of the view shot, camera movement and changes in shooting angles..."
BACKGROUND_CAPTION	Environment/setting	"Summary of the background including objects, location, weather, and time."

Usage Examples

import os
from lmms_eval.tasks.vdc.utils import (
    vdc_doc_to_visual,
    vdc_doc_to_text_detailed,
    vdc_process_results_generic,
    llmms_eval
)

# Set up environment
os.environ["HF_HOME"] = "/path/to/cache"

# Load video path
doc = {
    "video_name": "sample_video",
    "caption": "A person walks through a park on a sunny day",
    "qa_list": [
        {"question": "What is the person doing?", "answer": "walking"},
        {"question": "What is the weather?", "answer": "sunny"}
    ]
}
video_paths = vdc_doc_to_visual(doc)
print(video_paths[0])  # /path/to/cache/Test_Videos/sample_video.mp4

# Generate detailed caption prompt (randomized)
prompt = vdc_doc_to_text_detailed(doc)
print(prompt)
# "Please imagine the video based on the sequence of frames, and provide
# a faithfully detailed description of this video in more than three sentences."

# Process model results with LLM judge
model_caption = [
    "A person is walking through a green park with trees. "
    "The weather appears sunny with clear skies. "
    "The person walks at a steady pace along a paved path."
]
results = vdc_process_results_generic(doc, model_caption)
print(results["llmms_eval_score"]["score"])  # e.g., 4.5 (out of 5)
print(results["llmms_eval_acc"]["acc"])  # e.g., 1.0 (both QAs correct)

# Aggregate scores across dataset
all_scores = [
    {"video_name": "vid1", "score": 4.5, "acc": 1.0},
    {"video_name": "vid2", "score": 3.8, "acc": 0.8},
    {"video_name": "vid3", "score": 4.2, "acc": 0.9}
]
avg_score = vdc_aggregate_score(all_scores, args=None)
avg_acc = vdc_aggregate_acc(all_scores, args=None)
print(f"Average Score: {avg_score:.2f}")  # 4.17
print(f"Average Accuracy: {avg_acc:.2f}")  # 0.90

# Direct LLM evaluation (requires SGLang server)
from sglang import set_default_backend, RuntimeEndpoint
set_default_backend(RuntimeEndpoint("http://localhost:30000"))

eval_data = {
    "video_name": "sample_video",
    "pred": model_caption[0],
    "qa_list": doc["qa_list"]
}
eval_result = llmms_eval(eval_data)
print(f"Case Score: {eval_result['score']:.2f}")
print(f"Case Accuracy: {eval_result['acc']:.2f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment