Implementation:NVIDIA TransformerEngine Dropout C API

Field	Value
Sources	TransformerEngine
Domains	Deep_Learning, Optimization
Last Updated	2026-02-07 14:00 GMT

Overview

Declares the C API for forward and backward dropout operations on GPU tensors, using bitwise mask representation for memory efficiency.

Description

dropout.h exposes two extern "C" functions:

nvte_dropout_fwd: Generates a random binary mask using the provided RNG state and dropout probability, applies the mask element-wise to the input, and writes both the scaled output and the compact bit-packed mask tensor. Each bit in the mask corresponds to one output element (1 = kept, 0 = dropped).
nvte_dropout_bwd: Takes the incoming gradient and the stored mask to compute the input gradient by re-applying the same mask pattern with the inverse dropout probability scaling.

The bit-packed mask representation (one bit per element) reduces memory consumption compared to storing full float masks, which is important for long-sequence training.

Usage

Use for standalone dropout operations. Note that for attention, dropout is typically fused into the fused attention kernel.

Code Reference

Source Location

Repository: NVIDIA/TransformerEngine
File: transformer_engine/common/include/transformer_engine/dropout.h
Lines: 1--51

Signature

void nvte_dropout_fwd(const NVTETensor input, NVTETensor output,
                      NVTETensor mask, NVTETensor rng_state,
                      float dropout_probability, cudaStream_t stream);

void nvte_dropout_bwd(const NVTETensor grad_output, const NVTETensor mask,
                      NVTETensor grad_input, float dropout_probability,
                      cudaStream_t stream);

Import

#include "transformer_engine/dropout.h"

I/O Contract

Inputs

Name	Type	Required	Description
`input`	`NVTETensor`	Yes	Input tensor
`rng_state`	`NVTETensor`	Yes	RNG engine state for reproducible masking
`dropout_probability`	`float`	Yes	Probability of dropping each element
`stream`	`cudaStream_t`	Yes	CUDA stream

Outputs

Name	Type	Description
`output`	`NVTETensor`	Scaled output with dropout applied
`mask`	`NVTETensor`	Bit-packed dropout mask

Usage Examples

#include "transformer_engine/dropout.h"

// Forward: apply dropout
nvte_dropout_fwd(input, output, mask, rng_state, 0.1f, stream);

// Backward: apply mask to gradients
nvte_dropout_bwd(grad_output, mask, grad_input, 0.1f, stream);

Related Pages

Environment:NVIDIA_TransformerEngine_CUDA_Toolkit_Requirements

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment