Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq Make quant attn

From Leeroopedia

Revision as of 13:16, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Mit_han_lab_Llm_awq_Make_quant_attn.md)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Concrete tool for replacing standard attention with fused quantized attention modules in TinyChat models provided by the llm-awq library.

Source

File: tinychat/modules/fused_attn.py, Lines 549-634

Signature

def make_quant_attn(model, dev, flash_attn=True):
    ...

Import

from tinychat.modules import make_quant_attn

I/O

Inputs

model (nn.Module) - the model to modify
dev (str) - target device
flash_attn (bool, default True) - whether to use FlashAttention for prefilling

Output

None (model is modified in-place)

Details

Replaces LlamaAttention, LlamaAttentionFused, and Qwen2AttentionFused with QuantLlamaAttentionFusedFlash (when flash_attn=True) or QuantLlamaAttentionFused
Fuses q_proj, k_proj, v_proj into a single WQLinear layer

Related Pages

Knowledge Sources

Repo|llm-awq|https://github.com/mit-han-lab/llm-awq

Domains

Inference
Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Retrieved from "https://leeroopedia.com/index.php?title=Implementation:Mit_han_lab_Llm_awq_Make_quant_attn&oldid=8436"

Implementations