Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:EvolvingLMMs Lab Lmms eval Model Inference

From Leeroopedia
Knowledge Sources
Domains Evaluation, Model_Inference
Last Updated 2026-02-14 00:00 GMT

Overview

Model inference is the process of dispatching constructed evaluation requests to a model and collecting its outputs, supporting multiple request types including open-ended generation, log-likelihood scoring, and multi-round dialog.

Description

After request construction produces a list of Instance objects for each task, the evaluation framework must execute them against the model. Model inference is the step where the model actually processes inputs and produces outputs. The framework groups requests by type and dispatches them to the corresponding method on the model object using dynamic attribute lookup.

The lmms abstract base class defines three abstract methods that every model implementation must provide:

  • generate_until -- Open-ended text generation given a context and stopping criteria. Used for tasks like VQA, captioning, and instruction following where the model must produce free-form text.
  • loglikelihood -- Compute the log probability of a continuation given a context. Used for multiple-choice tasks and perplexity evaluation where the model scores candidate answers.
  • generate_until_multi_round -- Multi-round dialog generation where subsequent rounds can condition on previous model outputs. Used for interactive evaluation protocols.

The dispatch mechanism in the evaluator is:

resps = getattr(lm, reqtype)(cloned_reqs)

This means the request type string (e.g., "generate_until") is used to look up the method on the model object dynamically. For multiple-choice tasks, the request type is normalized to "loglikelihood" even though the output type is "multiple_choice".

Usage

Use model inference whenever:

  • You are running an evaluation pipeline and need to execute model forward passes.
  • You are implementing a new model and need to understand the interface contract.
  • You need to add caching around inference to avoid re-computing results.
  • You are debugging model outputs by inspecting the responses attached to Instance objects.

Theoretical Basis

The inference dispatch follows a request-response pattern with batching and padding for distributed execution:

Step 1 -- Request Grouping:

All instances across all tasks are grouped by request type into a dictionary:

requests = {
    "generate_until": [inst1, inst2, ...],
    "loglikelihood": [inst3, inst4, ...],
}

Step 2 -- Repetition Expansion:

Each request is cloned K = req.repeats times to support sampling-based evaluation:

cloned_reqs = []
for req in reqs:
    cloned_reqs.extend([req] * req.repeats)

Step 3 -- Padding for Distributed Execution:

In multi-GPU settings using FSDP or DDP, all ranks must process the same number of batches. The framework pads the request list on ranks with fewer instances:

numpad = max(gathered_counts) - gathered_counts[rank]
cloned_reqs += [padding_req] * numpad

Step 4 -- Model Execution:

Requests are dispatched to the model method:

resps = getattr(lm, reqtype)(cloned_reqs)

Step 5 -- Response Collection:

Each response is attached to the corresponding request:

for response, request in zip(resps, cloned_reqs):
    request.resps.append(response)

After inference completes, the model's CUDA memory is cleaned via lm.clean() to free GPU memory for any subsequent metric computation (e.g., LLM-as-judge).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment