Implementation:InternLM Lmdeploy Pipeline Call

Knowledge Sources	LMDeploy Pipeline API
Domains	LLM_Inference, Text_Generation
Last Updated	2026-02-07 15:00 GMT

Overview

Concrete tool for executing batch text generation through the Pipeline callable interface provided by the LMDeploy library.

Description

The Pipeline.__call__() method (and its underlying infer() and stream_infer() methods) is the primary interface for generating text. It accepts single or batched prompts in multiple formats, submits them to the async engine, and returns Response objects. Prompts are sorted by length for GPU efficiency and results are reordered to match the original input order.

Usage

Call the Pipeline object directly with prompts after initialization. Use the blocking mode for batch processing and stream_infer() for real-time streaming applications.

Code Reference

Source Location

Repository: lmdeploy
File: lmdeploy/pipeline.py
Lines: L83-122 (infer), L128-162 (stream_infer), L305-309 (__call__)

Signature

class Pipeline:
    def __call__(self,
                 prompts: List[str] | str | List[Dict] | List[List[Dict]],
                 gen_config: GenerationConfig | List[GenerationConfig] | None = None,
                 **kwargs) -> Response | List[Response]:
        return self.infer(prompts, gen_config=gen_config, **kwargs)

    def infer(self, prompts, gen_config=None, do_preprocess=None,
              adapter_name=None, **kwargs) -> List[Response]:
        ...

    def stream_infer(self, prompts, gen_config=None, do_preprocess=None,
                     adapter_name=None, stream_response=True,
                     **kwargs) -> Iterator[Iterator[Response]]:
        ...

Import

from lmdeploy import pipeline, GenerationConfig

I/O Contract

Inputs

Name	Type	Required	Description
prompts	str, List[str], List[Dict], or List[List[Dict]]	Yes	Single or batch prompts in string or OpenAI message format
gen_config	GenerationConfig or List[GenerationConfig]	No	Sampling parameters (per-prompt or shared)
do_preprocess	bool	No	Whether to apply chat template (default: True)
adapter_name	str	No	LoRA adapter name to use for this request

Outputs

Name	Type	Description
Response or List[Response]	Response	Generated text with metadata (text, token counts, finish_reason)
Iterator[Iterator[Response]]	Iterator	Streaming mode: nested iterators yielding partial responses

Usage Examples

Batch Inference

from lmdeploy import pipeline, GenerationConfig

pipe = pipeline('internlm/internlm2_5-7b-chat')

# Batch of prompts
prompts = [
    'Explain neural networks briefly.',
    'Write a Python hello world.',
    'What is the capital of France?'
]

gen_config = GenerationConfig(max_new_tokens=256, temperature=0.7)
responses = pipe(prompts, gen_config=gen_config)

for i, resp in enumerate(responses):
    print(f"Prompt {i}: {resp.text[:100]}...")
    print(f"  Tokens: {resp.generate_token_len}, Reason: {resp.finish_reason}")

Streaming Output

from lmdeploy import pipeline

pipe = pipeline('internlm/internlm2_5-7b-chat')

for stream_outputs in pipe.stream_infer(['Tell me a story']):
    for response in stream_outputs:
        print(response.text, end='', flush=True)
print()

Related Pages

Implements Principle

Principle:InternLM_Lmdeploy_Batch_Text_Generation

Uses Heuristic

Heuristic:InternLM_Lmdeploy_OOM_Troubleshooting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment