Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:InternLM Lmdeploy Batch Text Generation

From Leeroopedia


Knowledge Sources
Domains LLM_Inference, Text_Generation
Last Updated 2026-02-07 15:00 GMT

Overview

A mechanism for processing multiple text generation requests simultaneously through continuous batching with support for both blocking and streaming output modes.

Description

Batch Text Generation is the core inference operation that takes one or more prompts and produces model completions. The key design decisions are:

  • Continuous batching: New requests can join the batch without waiting for all current requests to finish
  • Prompt sorting: Requests are sorted by length for efficient GPU utilization before being submitted to the engine
  • Dual output modes: Blocking mode returns complete responses; streaming mode yields tokens as they are generated
  • Per-request configuration: Each prompt in a batch can have its own GenerationConfig (temperature, top_p, etc.)

The pipeline maintains a synchronous interface by running the async engine in a background thread with its own event loop, bridging sync and async paradigms.

Usage

Use this when performing offline batch inference (processing many prompts at once) or when building interactive applications. The blocking mode suits batch workloads; the streaming mode suits real-time chat interfaces.

Theoretical Basis

The generation process uses autoregressive decoding with configurable sampling:

P(xt|x<t)=softmax(logitst/T)

Where T is temperature. Sampling strategies include:

  • Greedy: Always pick the highest probability token
  • Top-k: Sample from the k most likely tokens
  • Top-p (nucleus): Sample from the smallest set of tokens whose cumulative probability exceeds p

Pseudo-code:

# Abstract batch generation
def batch_generate(prompts, config):
    sorted_prompts = sort_by_length(prompts)
    futures = [engine.submit(p, config) for p in sorted_prompts]
    results = await gather(futures)
    return unsort(results)  # Restore original order

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment