Principle:InternLM Lmdeploy Response Processing

Knowledge Sources	LMDeploy
Domains	LLM_Inference, Data_Structures
Last Updated	2026-02-07 15:00 GMT

Overview

A structured data pattern for encapsulating inference outputs including generated text, token counts, finish reasons, and optional logprobs into a unified response object.

Description

Response Processing defines how inference results are packaged and consumed. Each generation request produces a Response object containing:

Generated text: The decoded output string
Token counts: Both input (prompt) and output (generated) token lengths for usage tracking
Finish reason: Why generation stopped: stop (natural end or stop word hit) or length (max tokens reached)
Token IDs: Raw output token IDs for downstream processing
Logprobs: Optional per-token log probabilities for confidence estimation
Streaming extension: Responses can be incrementally extended as tokens arrive during streaming

The Response also supports resource cleanup through the Pipeline's context manager protocol (close() method), which releases GPU memory and stops background threads.

Usage

Use this when extracting results from the pipeline. Check finish_reason to determine if the output was truncated. Use token counts for billing/monitoring. Use logprobs for confidence-based filtering or best-of-n selection.

Theoretical Basis

The response pattern follows the Value Object pattern where each response is an immutable snapshot of generation state, with an extend() method for streaming aggregation:

# Abstract response processing
response = pipeline(prompt)
if response.finish_reason == 'length':
    warn("Output was truncated")
throughput = response.generate_token_len / elapsed_time

Related Pages

Implemented By

Implementation:InternLM_Lmdeploy_Response_Dataclass

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment