Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq StreamGenerator

From Leeroopedia

Overview

Concrete tool for token-by-token streaming text generation from AWQ-quantized models provided by the llm-awq library.

Source

tinychat/stream_generators/stream_gen.py, Lines 35-213

Signature

@torch.inference_mode()
def StreamGenerator(
    model,
    tokenizer,
    input: str,
    start_pos: int,
    gen_params: dict,
    device: str = "cuda:0",
    stream_interval: int = 2,
    echo: bool = False,
    stop_token_ids=[],
    chunk_prefilling=False,
    quant_llm=False,
):

Import

from tinychat.stream_generators.stream_gen import StreamGenerator

I/O

Inputs:

  • model - the loaded (quantized) language model
  • tokenizer - the tokenizer corresponding to the model
  • input (str) - the formatted prompt string
  • start_pos (int) - current KV cache position for multi-turn conversations
  • gen_params (dict) - generation parameters object with attributes:
    • temp - temperature for sampling
    • top_k - top-k filtering parameter
    • top_p - nucleus sampling threshold
    • repeat_penalty - repetition penalty factor
    • n_predict - maximum number of tokens to generate
    • n_vocab - vocabulary size
  • device (str) - target device, default "cuda:0"
  • stream_interval (int) - yield partial output every N tokens, default 2
  • echo (bool) - whether to include the input prompt in the output
  • stop_token_ids (list) - list of token IDs that signal generation should stop
  • chunk_prefilling (bool) - whether to prefill the context in chunks
  • quant_llm (bool) - whether the model uses QuantLLM quantization

Output:

  • Generator yielding dicts with keys:
    • "text" - the generated text so far
    • "usage" - token usage information
    • "finish_reason" - reason for stopping (None while generating, "stop" or "length" at end)
    • "timing" - timing statistics (only in final yield)

The final yield includes a timing dict with:

  • context_tokens - number of tokens in the input context
  • context_time - time taken for context prefilling
  • total_tokens - total number of generated tokens
  • generation_time_list - list of per-token generation times

Related Pages

Knowledge Sources

Domains

  • NLP
  • Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment