Principle:InternLM Lmdeploy Batch Text Generation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Text_Generation |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
A mechanism for processing multiple text generation requests simultaneously through continuous batching with support for both blocking and streaming output modes.
Description
Batch Text Generation is the core inference operation that takes one or more prompts and produces model completions. The key design decisions are:
- Continuous batching: New requests can join the batch without waiting for all current requests to finish
- Prompt sorting: Requests are sorted by length for efficient GPU utilization before being submitted to the engine
- Dual output modes: Blocking mode returns complete responses; streaming mode yields tokens as they are generated
- Per-request configuration: Each prompt in a batch can have its own GenerationConfig (temperature, top_p, etc.)
The pipeline maintains a synchronous interface by running the async engine in a background thread with its own event loop, bridging sync and async paradigms.
Usage
Use this when performing offline batch inference (processing many prompts at once) or when building interactive applications. The blocking mode suits batch workloads; the streaming mode suits real-time chat interfaces.
Theoretical Basis
The generation process uses autoregressive decoding with configurable sampling:
Where T is temperature. Sampling strategies include:
- Greedy: Always pick the highest probability token
- Top-k: Sample from the k most likely tokens
- Top-p (nucleus): Sample from the smallest set of tokens whose cumulative probability exceeds p
Pseudo-code:
# Abstract batch generation
def batch_generate(prompts, config):
sorted_prompts = sort_by_length(prompts)
futures = [engine.submit(p, config) for p in sorted_prompts]
results = await gather(futures)
return unsort(results) # Restore original order