Principle:InternLM Lmdeploy Batch Text Generation

Knowledge Sources	LMDeploy Pipeline Continuous Batching LMDeploy
Domains	LLM_Inference, Text_Generation
Last Updated	2026-02-07 15:00 GMT

Overview

A mechanism for processing multiple text generation requests simultaneously through continuous batching with support for both blocking and streaming output modes.

Description

Batch Text Generation is the core inference operation that takes one or more prompts and produces model completions. The key design decisions are:

Continuous batching: New requests can join the batch without waiting for all current requests to finish
Prompt sorting: Requests are sorted by length for efficient GPU utilization before being submitted to the engine
Dual output modes: Blocking mode returns complete responses; streaming mode yields tokens as they are generated
Per-request configuration: Each prompt in a batch can have its own GenerationConfig (temperature, top_p, etc.)

The pipeline maintains a synchronous interface by running the async engine in a background thread with its own event loop, bridging sync and async paradigms.

Usage

Use this when performing offline batch inference (processing many prompts at once) or when building interactive applications. The blocking mode suits batch workloads; the streaming mode suits real-time chat interfaces.

Theoretical Basis

The generation process uses autoregressive decoding with configurable sampling:

$P (x_{t} | x_{< t}) = softmax ({logits}_{t} / T)$

Where T is temperature. Sampling strategies include:

Greedy: Always pick the highest probability token
Top-k: Sample from the k most likely tokens
Top-p (nucleus): Sample from the smallest set of tokens whose cumulative probability exceeds p

Pseudo-code:

# Abstract batch generation
def batch_generate(prompts, config):
    sorted_prompts = sort_by_length(prompts)
    futures = [engine.submit(p, config) for p in sorted_prompts]
    results = await gather(futures)
    return unsort(results)  # Restore original order

Related Pages

Implemented By

Implementation:InternLM_Lmdeploy_Pipeline_Call

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment