Implementation:Vllm project Vllm DNNL Kernels
| Knowledge Sources | |
|---|---|
| Domains | CPU_Inference, DNNL, Quantization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Implements OneDNN-based INT8 quantization kernels and matmul operations for CPU inference, providing both static and dynamic per-token quantization paths.
Description
This file provides vectorized INT8 quantization implementations including static_scaled_int8_quant_impl for per-tensor quantization with optional asymmetric zero-point support and dynamic_scaled_int8_quant_impl for dynamic per-token scale computation. It also exposes onednn_mm for general floating-point matmul and onednn_qmatmul for quantized matmul using prepacked OneDNN primitives managed by the DNNL helper layer.
The kernels use architecture-specific vector types (AVX-512, ARM NEON, Power VSX) via the KernelVecType template specializations for float, BFloat16, and Half data types, with OpenMP parallelism for multi-core scaling.
Usage
These kernels are compiled as part of the vLLM CPU extension module and are called from Python through the torch C++ extension mechanism. They serve as the primary quantization and matmul backend when running vLLM on CPU with OneDNN acceleration.
Code Reference
Source Location
- Repository: vllm
- File: csrc/cpu/dnnl_kernels.cpp
- Lines: 1-570
Signature
void static_scaled_int8_quant(
torch::Tensor& out,
const torch::Tensor& input,
const torch::Tensor& scale,
std::optional<torch::Tensor> const& azp);
void dynamic_scaled_int8_quant(
torch::Tensor& out,
const torch::Tensor& input,
torch::Tensor& scale,
std::optional<torch::Tensor> const& azp);
int64_t create_onednn_mm_handler(
const torch::Tensor& b,
int64_t primitive_cache_size);
void onednn_mm(
torch::Tensor& c,
const torch::Tensor& a,
const std::optional<torch::Tensor>& bias,
const torch::Tensor& handler_tensor);
Import
#include "cpu_types.hpp"
#include "dnnl_helper.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | torch::Tensor [batch, hidden_size] | Yes | Floating-point activation tensor to be quantized (float, BFloat16, or Half) |
| scale | torch::Tensor [1] or [batch, 1] | Yes | Quantization scale factor; for static quant provided externally, for dynamic quant computed in-kernel |
| azp | torch::Tensor [1] or [batch, 1] | No | Asymmetric zero point for quantization; enables asymmetric quantization range when provided |
| a | torch::Tensor [M, IC] | Yes | Left-hand activation matrix for matmul (row-major) |
| b / handler_tensor | torch::Tensor | Yes | Prepacked weight handler (opaque int64 pointer) or raw weight tensor for matmul |
| bias | torch::Tensor [OC] | No | Optional bias vector added after matmul |
Outputs
| Name | Type | Description |
|---|---|---|
| out | torch::Tensor [batch, hidden_size] (int8) | Quantized INT8 output tensor for quantization kernels |
| c | torch::Tensor [M, OC] | Matmul result tensor for onednn_mm |
| handler_id | int64_t | Opaque handler identifier for create_onednn_mm_handler, used in subsequent onednn_mm calls |
Usage Examples
// Static INT8 quantization with per-tensor scale
torch::Tensor input = /* [num_tokens, hidden_size] float */;
torch::Tensor output = torch::empty_like(input, torch::kInt8);
torch::Tensor scale = torch::tensor({0.05f});
static_scaled_int8_quant(output, input, scale, std::nullopt);
// OneDNN matmul with prepacked weights
int64_t handler = create_onednn_mm_handler(weight, /*cache_size=*/16);
torch::Tensor handler_t = torch::tensor(handler, torch::kLong);
onednn_mm(output, activation, bias, handler_t);