Implementation:NVIDIA TransformerEngine Ops UB Forward Linear

Field	Value
Sources	TransformerEngine
Domains	Deep_Learning, PyTorch, Distributed, Optimization
Last Updated	2026-02-07 14:00 GMT

Overview

Fused forward linear implementation that uses NVIDIA Userbuffers to overlap tensor-parallel communication with GEMM computation during the forward pass.

Description

Composes BasicLinear + optional Bias + optional ReduceScatter. Provides a _functional_forward that uses Userbuffers communicators (keyed by layer type like "qkv", "proj", "fc1", "fc2") to overlap all-gather or reduce-scatter with the forward GEMM via CommOverlapType. Handles input quantization, weight quantization, FP8 compute, and bias addition. The fuse_forward_ops method scans for BasicLinear + optional Bias + optional ReduceScatter patterns where Userbuffers options are configured.

Usage

Achieves communication-compute overlap in the forward pass for distributed training, hiding tensor-parallel all-gather or reduce-scatter latency behind GEMM execution.

Code Reference

Source Location

Repository: NVIDIA/TransformerEngine
File: transformer_engine/pytorch/ops/fused/userbuffers_forward_linear.py
Lines: 1--447

Signature

class UserbuffersForwardLinear(FusedOperation):
    def __init__(self, *, linear, bias=None, reduce_scatter=None): ...

    @staticmethod
    def _functional_forward(
        input, weight, ...
    ): ...

    @staticmethod
    def fuse_forward_ops(ops: list[tuple[FusibleOperation, ...]]) -> list: ...

Import

from transformer_engine.pytorch.ops.fused import UserbuffersForwardLinear

I/O Contract

Inputs

Name	Type	Required	Description
input	`torch.Tensor`	Yes	Input tensor for linear transformation
weight	`torch.Tensor`	Yes	Weight parameter
linear	`BasicLinear`	Yes	The basic linear operation to fuse
bias	`Bias`	No	Optional bias operation
reduce_scatter	`ReduceScatter`	No	Optional reduce-scatter for TP

Outputs

Name	Type	Description
output	`torch.Tensor`	Result of fused linear + optional bias + optional communication

Usage Examples

# UserbuffersForwardLinear is automatically discovered by the OperationFuser
# when Userbuffers are configured.
from transformer_engine.pytorch.ops import Sequential
from transformer_engine.pytorch.ops.basic import BasicLinear, Bias

pipeline = Sequential(linear_op, bias_op)
# Fuser auto-detects and applies UB forward fusion
output = pipeline(input_tensor)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment