Implementation:Mit han lab Llm awq Make quant attn
Appearance
Overview
Concrete tool for replacing standard attention with fused quantized attention modules in TinyChat models provided by the llm-awq library.
Source
File: tinychat/modules/fused_attn.py, Lines 549-634
Signature
def make_quant_attn(model, dev, flash_attn=True):
...
Import
from tinychat.modules import make_quant_attn
I/O
Inputs
- model (nn.Module) - the model to modify
- dev (str) - target device
- flash_attn (bool, default True) - whether to use FlashAttention for prefilling
Output
- None (model is modified in-place)
Details
- Replaces LlamaAttention, LlamaAttentionFused, and Qwen2AttentionFused with QuantLlamaAttentionFusedFlash (when flash_attn=True) or QuantLlamaAttentionFused
- Fuses q_proj, k_proj, v_proj into a single WQLinear layer
Related Pages
- Principle:Mit_han_lab_Llm_awq_Fused_Attention_Optimization
- Environment:Mit_han_lab_Llm_awq_CUDA_Build_Environment
- Environment:Mit_han_lab_Llm_awq_Flash_Attention_Environment
- Heuristic:Mit_han_lab_Llm_awq_Kernel_Selection_Thresholds
Knowledge Sources
- Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
Domains
- Inference
- Optimization
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment