Implementation:InternLM Lmdeploy Pipeline Call
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Text_Generation |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Concrete tool for executing batch text generation through the Pipeline callable interface provided by the LMDeploy library.
Description
The Pipeline.__call__() method (and its underlying infer() and stream_infer() methods) is the primary interface for generating text. It accepts single or batched prompts in multiple formats, submits them to the async engine, and returns Response objects. Prompts are sorted by length for GPU efficiency and results are reordered to match the original input order.
Usage
Call the Pipeline object directly with prompts after initialization. Use the blocking mode for batch processing and stream_infer() for real-time streaming applications.
Code Reference
Source Location
- Repository: lmdeploy
- File: lmdeploy/pipeline.py
- Lines: L83-122 (infer), L128-162 (stream_infer), L305-309 (__call__)
Signature
class Pipeline:
def __call__(self,
prompts: List[str] | str | List[Dict] | List[List[Dict]],
gen_config: GenerationConfig | List[GenerationConfig] | None = None,
**kwargs) -> Response | List[Response]:
return self.infer(prompts, gen_config=gen_config, **kwargs)
def infer(self, prompts, gen_config=None, do_preprocess=None,
adapter_name=None, **kwargs) -> List[Response]:
...
def stream_infer(self, prompts, gen_config=None, do_preprocess=None,
adapter_name=None, stream_response=True,
**kwargs) -> Iterator[Iterator[Response]]:
...
Import
from lmdeploy import pipeline, GenerationConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompts | str, List[str], List[Dict], or List[List[Dict]] | Yes | Single or batch prompts in string or OpenAI message format |
| gen_config | GenerationConfig or List[GenerationConfig] | No | Sampling parameters (per-prompt or shared) |
| do_preprocess | bool | No | Whether to apply chat template (default: True) |
| adapter_name | str | No | LoRA adapter name to use for this request |
Outputs
| Name | Type | Description |
|---|---|---|
| Response or List[Response] | Response | Generated text with metadata (text, token counts, finish_reason) |
| Iterator[Iterator[Response]] | Iterator | Streaming mode: nested iterators yielding partial responses |
Usage Examples
Batch Inference
from lmdeploy import pipeline, GenerationConfig
pipe = pipeline('internlm/internlm2_5-7b-chat')
# Batch of prompts
prompts = [
'Explain neural networks briefly.',
'Write a Python hello world.',
'What is the capital of France?'
]
gen_config = GenerationConfig(max_new_tokens=256, temperature=0.7)
responses = pipe(prompts, gen_config=gen_config)
for i, resp in enumerate(responses):
print(f"Prompt {i}: {resp.text[:100]}...")
print(f" Tokens: {resp.generate_token_len}, Reason: {resp.finish_reason}")
Streaming Output
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2_5-7b-chat')
for stream_outputs in pipe.stream_infer(['Tell me a story']):
for response in stream_outputs:
print(response.text, end='', flush=True)
print()