Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA TransformerEngine Swizzle C API

From Leeroopedia
Revision as of 16:00, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/NVIDIA_TransformerEngine_Swizzle_C_API.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Field Value
Sources TransformerEngine
Domains Deep_Learning, Optimization
Last Updated 2026-02-07 14:00 GMT

Overview

Declares the C API for swizzling FP8 scaling factors into the interleaved memory layout required by cuBLASLt GEMM kernels.

Description

swizzle.h exposes three extern "C" functions:

  • nvte_swizzle_scaling_factors: Converts a single tensor's row-major scale_inv into the interleaved format. Requirements: scale_inv in row-major, padded to 128x4 (row-scale) or 4x128 (col-scale), quantized along K-dimension.
  • nvte_multi_tensor_swizzle_scaling_factors: Performs the same operation on multiple tensors in a single kernel launch, reducing launch overhead.
  • nvte_swizzle_block_scaling_to_mxfp8_scaling_factors: Converts FP8 block-scaling factors into MXFP8 interleaved layout for emulating block scaling on Blackwell+ architectures where native block scaling is not supported by cuBLASLt.

Without proper swizzling, FP8 GEMM results would be numerically incorrect because the tensor core kernels expect scale factors in a specific interleaved memory pattern.

Usage

Use after quantization and before GEMM execution to transform scaling factors into the layout expected by cuBLASLt.

Code Reference

Source Location

Repository
NVIDIA/TransformerEngine
File
transformer_engine/common/include/transformer_engine/swizzle.h
Lines
1--71

Signature

void nvte_swizzle_scaling_factors(const NVTETensor input, NVTETensor output,
                                  cudaStream_t stream);

void nvte_multi_tensor_swizzle_scaling_factors(const NVTETensor* inputs,
                                               NVTETensor* outputs,
                                               const size_t num_tensors,
                                               cudaStream_t stream);

void nvte_swizzle_block_scaling_to_mxfp8_scaling_factors(
    const NVTETensor input, NVTETensor output, cudaStream_t stream);

Import

#include "transformer_engine/swizzle.h"

I/O Contract

Inputs

Name Type Required Description
input NVTETensor Yes Tensor with non-swizzled scale_inv
stream cudaStream_t Yes CUDA stream

Outputs

Name Type Description
output NVTETensor Tensor with swizzled scale_inv for GEMM

Usage Examples

#include "transformer_engine/swizzle.h"

// Swizzle scaling factors before GEMM
nvte_swizzle_scaling_factors(quantized_tensor, gemm_ready_tensor, stream);

// Multi-tensor version for batch processing
nvte_multi_tensor_swizzle_scaling_factors(inputs, outputs, num_tensors, stream);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment