Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples BingBert Timer

From Leeroopedia


Knowledge Sources
Domains Performance Profiling, Distributed Training
Last Updated 2026-02-07 12:00 GMT

Overview

A timing utility module providing GPU-synchronized wall clock timers and throughput profiling with samples-per-second and steps-per-second metrics.

Description

timer.py provides two timer classes for profiling distributed BERT training performance. The SynchronizedWallClockTimer class (borrowed from NVIDIA Megatron) manages a collection of named timers that call torch.cuda.synchronize() before starting and stopping to ensure accurate measurement of GPU operations. Each timer tracks cumulative elapsed time and supports start/stop/reset operations with an elapsed method that optionally auto-resets after reading.

The ThroughputTimer class measures training throughput by tracking step durations and computing average samples per second and steps per second. It includes a configurable start_step parameter (defaulting to 2) that skips the first few warmup steps from throughput calculations, avoiding skewed measurements from initialization overhead. The print_elapsed_time method periodically logs forward pass execution time and optionally computes TFlops when given an operation count.

A helper function print_rank_0 ensures that log messages are printed only by rank 0 in a distributed training setup, preventing duplicate output from multiple processes. The SynchronizedWallClockTimer.log method formats multiple timer readings into a single log line with millisecond precision.

Usage

Use SynchronizedWallClockTimer for precise measurement of individual training phases (forward pass, backward pass, optimizer step) in GPU-accelerated training. Use ThroughputTimer for aggregate throughput measurements across the training run. Both are used in the Bing BERT training loop for performance monitoring.

Code Reference

Source Location

Signature

def print_rank_0(message):

class SynchronizedWallClockTimer:
    class Timer:
        def __init__(self, name):
        def start(self):
        def stop(self):
        def reset(self):
        def elapsed(self, reset=True):

    def __init__(self):
    def __call__(self, name):
    def log(self, names, normalizer=1.0, reset=True):

class ThroughputTimer(object):
    def __init__(self, name=None, batch_size=1, num_workers=1, start_step=2):
    def start(self, cond=True):
    def stop(self, cond=True):
    def avg_samples_per_sec(self):
    def avg_steps_per_sec(self):
    def print_elapsed_time(self, num_ops=None):

Import

from timer import SynchronizedWallClockTimer, ThroughputTimer, print_rank_0

I/O Contract

Inputs

Name Type Required Description
name str Yes Timer name for SynchronizedWallClockTimer.Timer; display name for ThroughputTimer
batch_size int No Number of samples per batch for throughput calculation. Default: 1
num_workers int No Number of distributed workers for throughput calculation. Default: 1
start_step int No Number of initial warmup steps to skip in throughput measurement. Default: 2
normalizer float No Divisor for averaging timer values in SynchronizedWallClockTimer.log. Default: 1.0
num_ops int No Number of floating point operations for TFlops computation in print_elapsed_time

Outputs

Name Type Description
elapsed float Elapsed time in seconds from SynchronizedWallClockTimer.Timer.elapsed()
avg_samples_per_sec float Average training samples per second (returns -999 if insufficient steps)
avg_steps_per_sec float Average training steps per second (returns -999 if insufficient steps)

Usage Examples

from timer import SynchronizedWallClockTimer, ThroughputTimer

# GPU-synchronized timing
timers = SynchronizedWallClockTimer()
timers('forward').start()
output = model(input_batch)
timers('forward').stop()
timers('backward').start()
loss.backward()
timers('backward').stop()
timers.log(['forward', 'backward'], normalizer=1.0)

# Throughput profiling
throughput = ThroughputTimer(name="training", batch_size=32, num_workers=8)
for step in range(num_steps):
    throughput.start()
    # ... training step ...
    throughput.stop()
print(f"Throughput: {throughput.avg_samples_per_sec():.1f} samples/sec")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment