Implementation:NVIDIA TransformerEngine Swizzle C API

Field	Value
Sources	TransformerEngine
Domains	Deep_Learning, Optimization
Last Updated	2026-02-07 14:00 GMT

Overview

Declares the C API for swizzling FP8 scaling factors into the interleaved memory layout required by cuBLASLt GEMM kernels.

Description

swizzle.h exposes three extern "C" functions:

nvte_swizzle_scaling_factors: Converts a single tensor's row-major scale_inv into the interleaved format. Requirements: scale_inv in row-major, padded to 128x4 (row-scale) or 4x128 (col-scale), quantized along K-dimension.
nvte_multi_tensor_swizzle_scaling_factors: Performs the same operation on multiple tensors in a single kernel launch, reducing launch overhead.
nvte_swizzle_block_scaling_to_mxfp8_scaling_factors: Converts FP8 block-scaling factors into MXFP8 interleaved layout for emulating block scaling on Blackwell+ architectures where native block scaling is not supported by cuBLASLt.

Without proper swizzling, FP8 GEMM results would be numerically incorrect because the tensor core kernels expect scale factors in a specific interleaved memory pattern.

Usage

Use after quantization and before GEMM execution to transform scaling factors into the layout expected by cuBLASLt.

Code Reference

Source Location

Repository: NVIDIA/TransformerEngine
File: transformer_engine/common/include/transformer_engine/swizzle.h
Lines: 1--71

Signature

void nvte_swizzle_scaling_factors(const NVTETensor input, NVTETensor output,
                                  cudaStream_t stream);

void nvte_multi_tensor_swizzle_scaling_factors(const NVTETensor* inputs,
                                               NVTETensor* outputs,
                                               const size_t num_tensors,
                                               cudaStream_t stream);

void nvte_swizzle_block_scaling_to_mxfp8_scaling_factors(
    const NVTETensor input, NVTETensor output, cudaStream_t stream);

Import

#include "transformer_engine/swizzle.h"

I/O Contract

Inputs

Name	Type	Required	Description
`input`	`NVTETensor`	Yes	Tensor with non-swizzled scale_inv
`stream`	`cudaStream_t`	Yes	CUDA stream

Outputs

Name	Type	Description
`output`	`NVTETensor`	Tensor with swizzled scale_inv for GEMM

Usage Examples

#include "transformer_engine/swizzle.h"

// Swizzle scaling factors before GEMM
nvte_swizzle_scaling_factors(quantized_tensor, gemm_ready_tensor, stream);

// Multi-tensor version for batch processing
nvte_multi_tensor_swizzle_scaling_factors(inputs, outputs, num_tensors, stream);

Related Pages

Environment:NVIDIA_TransformerEngine_CUDA_Toolkit_Requirements

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment