Implementation:Predibase Lorax Flash Qwen Modeling

Knowledge Sources	Predibase_Lorax
Domains	Model_Architecture, Inference
Last Updated	2026-02-08 00:00 GMT

Overview

Optimized Qwen (v1) transformer implementation for LoRax inference serving with flash attention, fused c_attn projections, and LoRA adapter support.

Description

FlashQwenForCausalLM implements the original Qwen (v1) architecture from Alibaba Cloud with flash attention for efficient batched inference. The module features a distinctive fused c_attn weight format for attention projections and a SwiGLU-style MLP.

The file contains seven classes organized as a layered architecture:

QwenConfig -- Configuration class extending PretrainedConfig with standard parameters including rope_scaling and rope_theta.
QwenRMSNorm -- RMS normalization with a fused dropout_layer_norm kernel for hidden dimensions up to 8192, falling back to a manual implementation for larger dimensions. Returns both the normalized output and the residual connection.
FlashQwenAttention -- Multi-head attention that loads a single fused c_attn projection and splits it into Q/K/V using offset-based slicing. Uses rotary position embeddings, flash attention for prefill and paged attention for decode. Supports adapter-aware projections via custom names (ATTN_C_ATTN, ATTN_C_PROJ).
QwenMLP -- SwiGLU-style MLP with separate w1/w2 projections (for gate and up) and a c_proj output projection. Uses adapter-aware layers with custom names (MLP_W1, MLP_W2, MLP_C_PROJ).
FlashQwenLayer -- Single transformer decoder layer combining attention and MLP with pre-norm RMS normalization.
FlashQwenModel -- Full transformer model (named transformer internally) stacking N decoder layers with token embeddings and final normalization.
FlashQwenForCausalLM -- Top-level causal language model that wraps the transformer model with a language model head supporting LoRA adapters.

The implementation supports FP8 KV cache quantization and tensor parallelism for multi-GPU serving.

Usage

Used internally by the LoRax server when serving original Qwen (v1) models from Alibaba Cloud. Loaded via the model registry when the model config type matches.

Code Reference

Source Location

Repository: Predibase_Lorax
File: server/lorax_server/models/custom_modeling/flash_qwen_modeling.py
Lines: 1-530

Signature

class FlashQwenForCausalLM(torch.nn.Module):
    def __init__(self, prefix: str, config, weights):
        ...

    def forward(
        self,
        input_ids: torch.Tensor,
        position_ids: torch.Tensor,
        cu_seqlen_prefill: Optional[torch.Tensor],
        kv_cache: List[Tuple[torch.Tensor, torch.Tensor]],
        block_tables: torch.Tensor,
        slots: torch.Tensor,
        seqlen: Seqlen,
        max_s: int,
        adapter_data: AdapterBatchData,
        prefill_cache_indices: Optional[torch.Tensor] = None,
        lm_head_indices: Optional[torch.Tensor] = None,
        skip_lm_head: bool = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
        ...

Import

from lorax_server.models.custom_modeling.flash_qwen_modeling import FlashQwenForCausalLM

I/O Contract

Inputs

Name	Type	Required	Description
input_ids	torch.Tensor	Yes	Token IDs [batch_size, seq_len]
position_ids	torch.Tensor	Yes	Position indices for rotary embeddings
cu_seqlen_prefill	Optional[torch.Tensor]	Yes	Cumulative sequence lengths for flash attention prefill (None during decode)
kv_cache	List[Tuple[torch.Tensor, torch.Tensor]]	Yes	Key-value cache tensors per layer
block_tables	torch.Tensor	Yes	Block table indices for paged attention
slots	torch.Tensor	Yes	Slot indices for KV cache placement
seqlen	Seqlen	Yes	Sequence length metadata wrapper
max_s	int	Yes	Maximum sequence length in the batch
adapter_data	AdapterBatchData	Yes	LoRA adapter weights and indices for the batch
prefill_cache_indices	Optional[torch.Tensor]	No	Indices for selective KV cache population during prefill
lm_head_indices	Optional[torch.Tensor]	No	Indices to select specific positions for LM head output
skip_lm_head	bool	No	If True, return hidden states without applying the LM head

Outputs

Name	Type	Description
logits	torch.Tensor	Next-token logits [batch_size, vocab_size] (or hidden states if skip_lm_head is True)
speculative_logits	Optional[torch.Tensor]	Speculative decoding logits from the multi-adapter head, or None

Usage Examples

# Internal usage within LoRax server
from lorax_server.models.custom_modeling.flash_qwen_modeling import FlashQwenForCausalLM

# Model is instantiated by the model registry, not directly by users
# See server/lorax_server/models/__init__.py for registration

Related Pages

Environment:Predibase_Lorax_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment