Principle:InternLM Lmdeploy Response Processing
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Data_Structures |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
A structured data pattern for encapsulating inference outputs including generated text, token counts, finish reasons, and optional logprobs into a unified response object.
Description
Response Processing defines how inference results are packaged and consumed. Each generation request produces a Response object containing:
- Generated text: The decoded output string
- Token counts: Both input (prompt) and output (generated) token lengths for usage tracking
- Finish reason: Why generation stopped: stop (natural end or stop word hit) or length (max tokens reached)
- Token IDs: Raw output token IDs for downstream processing
- Logprobs: Optional per-token log probabilities for confidence estimation
- Streaming extension: Responses can be incrementally extended as tokens arrive during streaming
The Response also supports resource cleanup through the Pipeline's context manager protocol (close() method), which releases GPU memory and stops background threads.
Usage
Use this when extracting results from the pipeline. Check finish_reason to determine if the output was truncated. Use token counts for billing/monitoring. Use logprobs for confidence-based filtering or best-of-n selection.
Theoretical Basis
The response pattern follows the Value Object pattern where each response is an immutable snapshot of generation state, with an extend() method for streaming aggregation:
# Abstract response processing
response = pipeline(prompt)
if response.finish_reason == 'length':
warn("Output was truncated")
throughput = response.generate_token_len / elapsed_time