Implementation:Vllm project Vllm DNNL Kernels

Knowledge Sources	vllm
Domains	CPU_Inference, DNNL, Quantization
Last Updated	2026-02-08 00:00 GMT

Overview

Implements OneDNN-based INT8 quantization kernels and matmul operations for CPU inference, providing both static and dynamic per-token quantization paths.

Description

This file provides vectorized INT8 quantization implementations including static_scaled_int8_quant_impl for per-tensor quantization with optional asymmetric zero-point support and dynamic_scaled_int8_quant_impl for dynamic per-token scale computation. It also exposes onednn_mm for general floating-point matmul and onednn_qmatmul for quantized matmul using prepacked OneDNN primitives managed by the DNNL helper layer.

The kernels use architecture-specific vector types (AVX-512, ARM NEON, Power VSX) via the KernelVecType template specializations for float, BFloat16, and Half data types, with OpenMP parallelism for multi-core scaling.

Usage

These kernels are compiled as part of the vLLM CPU extension module and are called from Python through the torch C++ extension mechanism. They serve as the primary quantization and matmul backend when running vLLM on CPU with OneDNN acceleration.

Code Reference

Source Location

Repository: vllm
File: csrc/cpu/dnnl_kernels.cpp
Lines: 1-570

Signature

void static_scaled_int8_quant(
    torch::Tensor& out,
    const torch::Tensor& input,
    const torch::Tensor& scale,
    std::optional<torch::Tensor> const& azp);

void dynamic_scaled_int8_quant(
    torch::Tensor& out,
    const torch::Tensor& input,
    torch::Tensor& scale,
    std::optional<torch::Tensor> const& azp);

int64_t create_onednn_mm_handler(
    const torch::Tensor& b,
    int64_t primitive_cache_size);

void onednn_mm(
    torch::Tensor& c,
    const torch::Tensor& a,
    const std::optional<torch::Tensor>& bias,
    const torch::Tensor& handler_tensor);

Import

#include "cpu_types.hpp"
#include "dnnl_helper.h"

I/O Contract

Inputs

Name	Type	Required	Description
input	torch::Tensor [batch, hidden_size]	Yes	Floating-point activation tensor to be quantized (float, BFloat16, or Half)
scale	torch::Tensor [1] or [batch, 1]	Yes	Quantization scale factor; for static quant provided externally, for dynamic quant computed in-kernel
azp	torch::Tensor [1] or [batch, 1]	No	Asymmetric zero point for quantization; enables asymmetric quantization range when provided
a	torch::Tensor [M, IC]	Yes	Left-hand activation matrix for matmul (row-major)
b / handler_tensor	torch::Tensor	Yes	Prepacked weight handler (opaque int64 pointer) or raw weight tensor for matmul
bias	torch::Tensor [OC]	No	Optional bias vector added after matmul

Outputs

Name	Type	Description
out	torch::Tensor [batch, hidden_size] (int8)	Quantized INT8 output tensor for quantization kernels
c	torch::Tensor [M, OC]	Matmul result tensor for onednn_mm
handler_id	int64_t	Opaque handler identifier for create_onednn_mm_handler, used in subsequent onednn_mm calls

Usage Examples

// Static INT8 quantization with per-tensor scale
torch::Tensor input = /* [num_tokens, hidden_size] float */;
torch::Tensor output = torch::empty_like(input, torch::kInt8);
torch::Tensor scale = torch::tensor({0.05f});
static_scaled_int8_quant(output, input, scale, std::nullopt);

// OneDNN matmul with prepacked weights
int64_t handler = create_onednn_mm_handler(weight, /*cache_size=*/16);
torch::Tensor handler_t = torch::tensor(handler, torch::kLong);
onednn_mm(output, activation, bias, handler_t);

Related Pages

Environment:Vllm_project_Vllm_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment