Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm DNNL Kernels

From Leeroopedia


Knowledge Sources
Domains CPU_Inference, DNNL, Quantization
Last Updated 2026-02-08 00:00 GMT

Overview

Implements OneDNN-based INT8 quantization kernels and matmul operations for CPU inference, providing both static and dynamic per-token quantization paths.

Description

This file provides vectorized INT8 quantization implementations including static_scaled_int8_quant_impl for per-tensor quantization with optional asymmetric zero-point support and dynamic_scaled_int8_quant_impl for dynamic per-token scale computation. It also exposes onednn_mm for general floating-point matmul and onednn_qmatmul for quantized matmul using prepacked OneDNN primitives managed by the DNNL helper layer.

The kernels use architecture-specific vector types (AVX-512, ARM NEON, Power VSX) via the KernelVecType template specializations for float, BFloat16, and Half data types, with OpenMP parallelism for multi-core scaling.

Usage

These kernels are compiled as part of the vLLM CPU extension module and are called from Python through the torch C++ extension mechanism. They serve as the primary quantization and matmul backend when running vLLM on CPU with OneDNN acceleration.

Code Reference

Source Location

Signature

void static_scaled_int8_quant(
    torch::Tensor& out,
    const torch::Tensor& input,
    const torch::Tensor& scale,
    std::optional<torch::Tensor> const& azp);

void dynamic_scaled_int8_quant(
    torch::Tensor& out,
    const torch::Tensor& input,
    torch::Tensor& scale,
    std::optional<torch::Tensor> const& azp);

int64_t create_onednn_mm_handler(
    const torch::Tensor& b,
    int64_t primitive_cache_size);

void onednn_mm(
    torch::Tensor& c,
    const torch::Tensor& a,
    const std::optional<torch::Tensor>& bias,
    const torch::Tensor& handler_tensor);

Import

#include "cpu_types.hpp"
#include "dnnl_helper.h"

I/O Contract

Inputs

Name Type Required Description
input torch::Tensor [batch, hidden_size] Yes Floating-point activation tensor to be quantized (float, BFloat16, or Half)
scale torch::Tensor [1] or [batch, 1] Yes Quantization scale factor; for static quant provided externally, for dynamic quant computed in-kernel
azp torch::Tensor [1] or [batch, 1] No Asymmetric zero point for quantization; enables asymmetric quantization range when provided
a torch::Tensor [M, IC] Yes Left-hand activation matrix for matmul (row-major)
b / handler_tensor torch::Tensor Yes Prepacked weight handler (opaque int64 pointer) or raw weight tensor for matmul
bias torch::Tensor [OC] No Optional bias vector added after matmul

Outputs

Name Type Description
out torch::Tensor [batch, hidden_size] (int8) Quantized INT8 output tensor for quantization kernels
c torch::Tensor [M, OC] Matmul result tensor for onednn_mm
handler_id int64_t Opaque handler identifier for create_onednn_mm_handler, used in subsequent onednn_mm calls

Usage Examples

// Static INT8 quantization with per-tensor scale
torch::Tensor input = /* [num_tokens, hidden_size] float */;
torch::Tensor output = torch::empty_like(input, torch::kInt8);
torch::Tensor scale = torch::tensor({0.05f});
static_scaled_int8_quant(output, input, scale, std::nullopt);

// OneDNN matmul with prepacked weights
int64_t handler = create_onednn_mm_handler(weight, /*cache_size=*/16);
torch::Tensor handler_t = torch::tensor(handler, torch::kLong);
onednn_mm(output, activation, bias, handler_t);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment