Implementation:Mit han lab Llm awq StreamGenerator
Appearance
Overview
Concrete tool for token-by-token streaming text generation from AWQ-quantized models provided by the llm-awq library.
Source
tinychat/stream_generators/stream_gen.py, Lines 35-213
Signature
@torch.inference_mode()
def StreamGenerator(
model,
tokenizer,
input: str,
start_pos: int,
gen_params: dict,
device: str = "cuda:0",
stream_interval: int = 2,
echo: bool = False,
stop_token_ids=[],
chunk_prefilling=False,
quant_llm=False,
):
Import
from tinychat.stream_generators.stream_gen import StreamGenerator
I/O
Inputs:
- model - the loaded (quantized) language model
- tokenizer - the tokenizer corresponding to the model
- input (str) - the formatted prompt string
- start_pos (int) - current KV cache position for multi-turn conversations
- gen_params (dict) - generation parameters object with attributes:
- temp - temperature for sampling
- top_k - top-k filtering parameter
- top_p - nucleus sampling threshold
- repeat_penalty - repetition penalty factor
- n_predict - maximum number of tokens to generate
- n_vocab - vocabulary size
- device (str) - target device, default "cuda:0"
- stream_interval (int) - yield partial output every N tokens, default 2
- echo (bool) - whether to include the input prompt in the output
- stop_token_ids (list) - list of token IDs that signal generation should stop
- chunk_prefilling (bool) - whether to prefill the context in chunks
- quant_llm (bool) - whether the model uses QuantLLM quantization
Output:
- Generator yielding dicts with keys:
- "text" - the generated text so far
- "usage" - token usage information
- "finish_reason" - reason for stopping (None while generating, "stop" or "length" at end)
- "timing" - timing statistics (only in final yield)
The final yield includes a timing dict with:
- context_tokens - number of tokens in the input context
- context_time - time taken for context prefilling
- total_tokens - total number of generated tokens
- generation_time_list - list of per-token generation times
Related Pages
- Principle:Mit_han_lab_Llm_awq_Streaming_Text_Generation
- Environment:Mit_han_lab_Llm_awq_Flash_Attention_Environment
Knowledge Sources
- Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
Domains
- NLP
- Inference
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment