Heuristic:ContextualAI HALOs TF32 Matmul Acceleration

Knowledge Sources	ContextualAI HALOs PyTorch TF32
Domains	Optimization, GPU_Performance
Last Updated	2026-02-08 03:00 GMT

Overview

Enabling TensorFloat-32 (TF32) for CUDA matrix multiplications provides significant speedup on Ampere+ GPUs with negligible precision loss.

Description

Both `launch.py` and `train/trainers.py` set `torch.backends.cuda.matmul.allow_tf32 = True` at module import time. TF32 is a math mode available on NVIDIA Ampere (A100) and newer architectures that uses 19-bit precision for matrix multiplications instead of full 32-bit FP32. This provides up to 8x throughput improvement for matmul operations while maintaining sufficient precision for deep learning training. The setting is applied globally and affects all subsequent CUDA matrix operations.

Usage

This optimization is always enabled by default in HALOs. It is set unconditionally at the top of both main entry points (`launch.py` and `trainers.py`). No configuration is needed. It is only effective on Ampere (A100), Ada Lovelace (L40, RTX 4090), and Hopper (H100) GPUs. On older architectures (V100, etc.), the flag is silently ignored.

The Insight (Rule of Thumb)

Action: Set `torch.backends.cuda.matmul.allow_tf32 = True` before any CUDA operations.
Value: Up to 8x throughput for FP32 matmuls on Ampere+ GPUs.
Trade-off: TF32 uses 10 bits of mantissa (vs 23 for FP32), which is sufficient for deep learning but may cause slight numerical differences in very precision-sensitive operations.
Note: This is complementary to, not a replacement for, using `bfloat16` or `float16` dtypes. The default `policy_dtype` and `reference_dtype` in HALOs is already `bfloat16`.

Reasoning

Large language model training is dominated by matrix multiplications (attention, linear layers). TF32 mode allows the GPU tensor cores to operate at higher throughput by using a reduced-precision format that is still accurate enough for gradient-based optimization. Since HALOs already uses `bfloat16` for model weights by default, the TF32 flag primarily accelerates any remaining FP32 operations (e.g., loss computation, value head in PPO which explicitly uses FP32).

Code Evidence

Global TF32 enablement in `launch.py:22-23`:

import torch
torch.backends.cuda.matmul.allow_tf32 = True

Identical enablement in `train/trainers.py:20-21`:

import torch
torch.backends.cuda.matmul.allow_tf32 = True

Value head forced to FP32 (benefits from TF32 speedup) in `train/models.py:233-234`:

def forward(self, hidden_states):
    # detach so that loss isn't backproped through LM
    # upcast since fp32 is important for good value predictions
    hidden_states = hidden_states.detach().to(torch.float32)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment