Implementation:Mit han lab Llm awq StreamGenerator

Overview

Concrete tool for token-by-token streaming text generation from AWQ-quantized models provided by the llm-awq library.

Source

tinychat/stream_generators/stream_gen.py, Lines 35-213

Signature

@torch.inference_mode()
def StreamGenerator(
    model,
    tokenizer,
    input: str,
    start_pos: int,
    gen_params: dict,
    device: str = "cuda:0",
    stream_interval: int = 2,
    echo: bool = False,
    stop_token_ids=[],
    chunk_prefilling=False,
    quant_llm=False,
):

Import

from tinychat.stream_generators.stream_gen import StreamGenerator

I/O

Inputs:

model - the loaded (quantized) language model
tokenizer - the tokenizer corresponding to the model
input (str) - the formatted prompt string
start_pos (int) - current KV cache position for multi-turn conversations
gen_params (dict) - generation parameters object with attributes:
- temp - temperature for sampling
- top_k - top-k filtering parameter
- top_p - nucleus sampling threshold
- repeat_penalty - repetition penalty factor
- n_predict - maximum number of tokens to generate
- n_vocab - vocabulary size
device (str) - target device, default "cuda:0"
stream_interval (int) - yield partial output every N tokens, default 2
echo (bool) - whether to include the input prompt in the output
stop_token_ids (list) - list of token IDs that signal generation should stop
chunk_prefilling (bool) - whether to prefill the context in chunks
quant_llm (bool) - whether the model uses QuantLLM quantization

Output:

Generator yielding dicts with keys:
- "text" - the generated text so far
- "usage" - token usage information
- "finish_reason" - reason for stopping (None while generating, "stop" or "length" at end)
- "timing" - timing statistics (only in final yield)

The final yield includes a timing dict with:

context_tokens - number of tokens in the input context
context_time - time taken for context prefilling
total_tokens - total number of generated tokens
generation_time_list - list of per-token generation times

Related Pages

Knowledge Sources

Repo|llm-awq|https://github.com/mit-han-lab/llm-awq

Domains

NLP
Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment