Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA TransformerEngine Padding C API

From Leeroopedia


Field Value
Sources TransformerEngine
Domains Deep_Learning, Optimization
Last Updated 2026-02-07 14:00 GMT

Overview

Declares the C API for batch padding and unpadding multiple 2D tensors along the row dimension, used to align tensor dimensions for efficient GEMM and batch-oriented computation.

Description

padding.h exposes two extern "C" functions:

  • nvte_multi_padding: Pads multiple input tensors by adding zero-filled rows at the bottom to reach specified padded row counts. Operates on lists of NVTETensor pointers with corresponding padded row count arrays.
  • nvte_multi_unpadding: Removes bottom rows to restore original unpadded dimensions. The reverse operation of padding.

Both functions execute on a single CUDA stream. Only bottom padding mode is supported.

Usage

Use when variable-length sequences need to be padded to uniform dimensions for GEMM or when results need to be unpadded back to original sizes.

Code Reference

Source Location

Repository
NVIDIA/TransformerEngine
File
transformer_engine/common/include/transformer_engine/padding.h
Lines
1--78

Signature

void nvte_multi_padding(size_t num_tensors, const NVTETensor* input_list,
                        NVTETensor* output_list,
                        const int* padded_num_rows_list,
                        cudaStream_t stream);

void nvte_multi_unpadding(size_t num_tensors, const NVTETensor* input_list,
                          NVTETensor* output_list,
                          const int* unpadded_num_rows_list,
                          cudaStream_t stream);

Import

#include "transformer_engine/padding.h"

I/O Contract

Inputs

Name Type Required Description
num_tensors size_t Yes Number of tensors to process
input_list const NVTETensor* Yes Array of 2D input tensors
padded_num_rows_list const int* Yes Target padded row count per tensor
stream cudaStream_t Yes CUDA stream

Outputs

Name Type Description
output_list NVTETensor* Array of padded (or unpadded) output tensors

Usage Examples

#include "transformer_engine/padding.h"

// Pad 3 tensors to aligned row counts
int padded_rows[] = {128, 256, 128};
nvte_multi_padding(3, input_list, output_list, padded_rows, stream);

// Unpad back to original sizes
int original_rows[] = {100, 200, 115};
nvte_multi_unpadding(3, padded_list, unpadded_list, original_rows, stream);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment