Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Bitsandbytes foundation Bitsandbytes Matmul Perf Model

From Leeroopedia
Revision as of 14:34, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Bitsandbytes_foundation_Bitsandbytes_Matmul_Perf_Model.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Performance_Modeling, Triton, Autotuning
Last Updated 2026-02-07 13:31 GMT

Overview

Analytical performance model for Triton matmul kernels that estimates execution time and prunes autotuning configurations based on hardware constraints and compute-memory overlap.

Description

This module provides an analytical performance model adapted from the Triton kernels repository. estimate_matmul_time calculates the expected execution time of a tiled matrix multiplication kernel by modeling three components: compute time (based on tensor core TFLOPS), memory loading time (modeling DRAM and L2 cache bandwidth with CTA occupancy), and store time. The total time is max(compute, load) + store. early_config_prune uses this model plus hardware constraints (shared memory limits, supported dtypes, optimal pipeline stages for Ampere+ GPUs) to reduce the Triton autotuning search space.

Usage

Used internally by the Triton INT8 matmul kernels to speed up autotuning by pruning unpromising configurations before benchmarking. It avoids testing configurations that would exceed shared memory or that have suboptimal pipeline depth.

Code Reference

Source Location

Signature

def estimate_matmul_time(
    num_warps: int, num_stages: int,
    A: torch.Tensor, B: torch.Tensor, C: torch.Tensor,
    M: int, N: int, K: int,
    BLOCK_M: int, BLOCK_N: int, BLOCK_K: int, SPLIT_K: int,
    debug: bool = False, **kwargs
) -> float:
    """Return estimated kernel running time in milliseconds."""

def early_config_prune(configs, named_args, **kwargs) -> list:
    """Prune Triton autotuning configs based on hardware constraints."""

def get_tflops(device, num_ctas, num_warps, dtype) -> float:
    """Return compute throughput in TFLOPS for given occupancy."""

Import

from bitsandbytes.triton.matmul_perf_model import (
    estimate_matmul_time,
    early_config_prune,
)

I/O Contract

Inputs

Name Type Required Description
M, N, K int Yes Matrix dimensions
BLOCK_M, BLOCK_N, BLOCK_K int Yes Tiling block sizes
SPLIT_K int Yes K-dimension split factor
num_warps int Yes Warps per CTA
num_stages int Yes Pipeline stages
A, B torch.Tensor Yes Input tensors (for dtype and element_size)

Outputs

Name Type Description
time_ms float Estimated kernel execution time in milliseconds
pruned_configs list Filtered autotuning configurations

Usage Examples

Estimating Matmul Time

from bitsandbytes.triton.matmul_perf_model import estimate_matmul_time

A = torch.randn(4096, 4096, device="cuda", dtype=torch.float16)
B = torch.randn(4096, 4096, device="cuda", dtype=torch.float16)

time_ms = estimate_matmul_time(
    num_warps=4, num_stages=3,
    A=A, B=B, C=None,
    M=4096, N=4096, K=4096,
    BLOCK_M=128, BLOCK_N=128, BLOCK_K=32, SPLIT_K=1,
    debug=True,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment