Implementation:NVIDIA TransformerEngine Ops UB Forward Linear
| Field | Value |
|---|---|
| Sources | TransformerEngine |
| Domains | Deep_Learning, PyTorch, Distributed, Optimization |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Fused forward linear implementation that uses NVIDIA Userbuffers to overlap tensor-parallel communication with GEMM computation during the forward pass.
Description
Composes BasicLinear + optional Bias + optional ReduceScatter. Provides a _functional_forward that uses Userbuffers communicators (keyed by layer type like "qkv", "proj", "fc1", "fc2") to overlap all-gather or reduce-scatter with the forward GEMM via CommOverlapType. Handles input quantization, weight quantization, FP8 compute, and bias addition. The fuse_forward_ops method scans for BasicLinear + optional Bias + optional ReduceScatter patterns where Userbuffers options are configured.
Usage
Achieves communication-compute overlap in the forward pass for distributed training, hiding tensor-parallel all-gather or reduce-scatter latency behind GEMM execution.
Code Reference
Source Location
- Repository
NVIDIA/TransformerEngine- File
transformer_engine/pytorch/ops/fused/userbuffers_forward_linear.py- Lines
- 1--447
Signature
class UserbuffersForwardLinear(FusedOperation):
def __init__(self, *, linear, bias=None, reduce_scatter=None): ...
@staticmethod
def _functional_forward(
input, weight, ...
): ...
@staticmethod
def fuse_forward_ops(ops: list[tuple[FusibleOperation, ...]]) -> list: ...
Import
from transformer_engine.pytorch.ops.fused import UserbuffersForwardLinear
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input | torch.Tensor |
Yes | Input tensor for linear transformation |
| weight | torch.Tensor |
Yes | Weight parameter |
| linear | BasicLinear |
Yes | The basic linear operation to fuse |
| bias | Bias |
No | Optional bias operation |
| reduce_scatter | ReduceScatter |
No | Optional reduce-scatter for TP |
Outputs
| Name | Type | Description |
|---|---|---|
| output | torch.Tensor |
Result of fused linear + optional bias + optional communication |
Usage Examples
# UserbuffersForwardLinear is automatically discovered by the OperationFuser
# when Userbuffers are configured.
from transformer_engine.pytorch.ops import Sequential
from transformer_engine.pytorch.ops.basic import BasicLinear, Bias
pipeline = Sequential(linear_op, bias_op)
# Fuser auto-detects and applies UB forward fusion
output = pipeline(input_tensor)