Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval VDC Video Captioning Utils

From Leeroopedia
Knowledge Sources
Domains Vision, Evaluation, Video_Understanding, Captioning
Last Updated 2026-02-14 00:00 GMT

Overview

Utility functions for VDC (Video Dense Captioning) benchmark evaluation using SGLang-based LLM judge to assess caption quality through QA-based verification.

Description

This module implements VDC evaluation using a two-stage pipeline: (1) Model generates detailed video captions (short, detailed, main_object, camera, or background descriptions), (2) SGLang-powered LLM judge generates QA pairs from captions and scores answer correctness. The evaluation uses randomized prompt templates for each caption type to test model robustness, and calculates both accuracy (yes/no matches) and quality scores (0-5 scale). Requires SGLang server running locally on port 30000 for LLM-based evaluation.

Usage

Use this when evaluating video understanding models on dense captioning tasks. Configure HF_HOME environment variable for video cache location. The module supports five captioning modes: short (one-sentence summary), detailed (3+ sentences), main_object (subject actions/movements), camera (viewpoint/movements), and background (environment/setting). Requires SGLang server for llmms_eval scoring.

Code Reference

Source Location

Signature

def vdc_doc_to_visual(doc: Dict) -> List[str]
def vdc_doc_to_text_short(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_detailed(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_main_object(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_camera(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_text_background(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def vdc_doc_to_answer(doc: Dict) -> str

@function
def gener_pred_response(s, pred_cap: str, q: str)

@function
def gener_pred_score(s, qa: Dict)

def llmms_eval(data_dict: Dict) -> Dict
def vdc_process_results_generic(doc: Dict, result: List[str]) -> Dict
def vdc_aggregate_score(results: List[Dict], args: Any) -> float
def vdc_aggregate_acc(results: List[Dict], args: Any) -> float

Import

from lmms_eval.tasks.vdc.utils import (
    vdc_doc_to_visual,
    vdc_doc_to_text_detailed,
    vdc_process_results_generic,
    vdc_aggregate_score,
    vdc_aggregate_acc
)

I/O Contract

vdc_doc_to_visual Input

Field Type Description
doc["video_name"] str Video filename without extension (e.g., "video_001")

vdc_doc_to_visual Output

Field Type Description
video_path List[str] Single-element list with absolute path to video file (.mp4, .MP4, or .mkv)

llmms_eval Input

Field Type Description
data_dict["video_name"] str Video identifier
data_dict["pred"] str Model-generated caption
data_dict["qa_list"] List[Dict] QA pairs with keys: question, answer

llmms_eval Output

Field Type Description
video_name str Video identifier
score float Average quality score (0.0 to 5.0)
acc float Proportion of correct answers (0.0 to 1.0)

Prompt Templates

Type Description Example Prompt
SHORT_CAPTION One-sentence summary "Write a one-sentence summary of the video."
DETAILED_CAPTION 3+ sentence description "Provide a faithfully detailed description of this video in more than three sentences."
MAIN_OBJECT_CAPTION Subject actions/movements "Description of the main subject actions or status sequence..."
CAMERA_CAPTION Camera movements/angles "Summary of the view shot, camera movement and changes in shooting angles..."
BACKGROUND_CAPTION Environment/setting "Summary of the background including objects, location, weather, and time."

Usage Examples

import os
from lmms_eval.tasks.vdc.utils import (
    vdc_doc_to_visual,
    vdc_doc_to_text_detailed,
    vdc_process_results_generic,
    llmms_eval
)

# Set up environment
os.environ["HF_HOME"] = "/path/to/cache"

# Load video path
doc = {
    "video_name": "sample_video",
    "caption": "A person walks through a park on a sunny day",
    "qa_list": [
        {"question": "What is the person doing?", "answer": "walking"},
        {"question": "What is the weather?", "answer": "sunny"}
    ]
}
video_paths = vdc_doc_to_visual(doc)
print(video_paths[0])  # /path/to/cache/Test_Videos/sample_video.mp4

# Generate detailed caption prompt (randomized)
prompt = vdc_doc_to_text_detailed(doc)
print(prompt)
# "Please imagine the video based on the sequence of frames, and provide
# a faithfully detailed description of this video in more than three sentences."

# Process model results with LLM judge
model_caption = [
    "A person is walking through a green park with trees. "
    "The weather appears sunny with clear skies. "
    "The person walks at a steady pace along a paved path."
]
results = vdc_process_results_generic(doc, model_caption)
print(results["llmms_eval_score"]["score"])  # e.g., 4.5 (out of 5)
print(results["llmms_eval_acc"]["acc"])  # e.g., 1.0 (both QAs correct)

# Aggregate scores across dataset
all_scores = [
    {"video_name": "vid1", "score": 4.5, "acc": 1.0},
    {"video_name": "vid2", "score": 3.8, "acc": 0.8},
    {"video_name": "vid3", "score": 4.2, "acc": 0.9}
]
avg_score = vdc_aggregate_score(all_scores, args=None)
avg_acc = vdc_aggregate_acc(all_scores, args=None)
print(f"Average Score: {avg_score:.2f}")  # 4.17
print(f"Average Accuracy: {avg_acc:.2f}")  # 0.90

# Direct LLM evaluation (requires SGLang server)
from sglang import set_default_backend, RuntimeEndpoint
set_default_backend(RuntimeEndpoint("http://localhost:30000"))

eval_data = {
    "video_name": "sample_video",
    "pred": model_caption[0],
    "qa_list": doc["qa_list"]
}
eval_result = llmms_eval(eval_data)
print(f"Case Score: {eval_result['score']:.2f}")
print(f"Case Accuracy: {eval_result['acc']:.2f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment