Implementation:Microsoft DeepSpeedExamples BingBert Timer

Knowledge Sources	Microsoft_DeepSpeedExamples
Domains	Performance Profiling, Distributed Training
Last Updated	2026-02-07 12:00 GMT

Overview

A timing utility module providing GPU-synchronized wall clock timers and throughput profiling with samples-per-second and steps-per-second metrics.

Description

timer.py provides two timer classes for profiling distributed BERT training performance. The SynchronizedWallClockTimer class (borrowed from NVIDIA Megatron) manages a collection of named timers that call torch.cuda.synchronize() before starting and stopping to ensure accurate measurement of GPU operations. Each timer tracks cumulative elapsed time and supports start/stop/reset operations with an elapsed method that optionally auto-resets after reading.

The ThroughputTimer class measures training throughput by tracking step durations and computing average samples per second and steps per second. It includes a configurable start_step parameter (defaulting to 2) that skips the first few warmup steps from throughput calculations, avoiding skewed measurements from initialization overhead. The print_elapsed_time method periodically logs forward pass execution time and optionally computes TFlops when given an operation count.

A helper function print_rank_0 ensures that log messages are printed only by rank 0 in a distributed training setup, preventing duplicate output from multiple processes. The SynchronizedWallClockTimer.log method formats multiple timer readings into a single log line with millisecond precision.

Usage

Use SynchronizedWallClockTimer for precise measurement of individual training phases (forward pass, backward pass, optimizer step) in GPU-accelerated training. Use ThroughputTimer for aggregate throughput measurements across the training run. Both are used in the Bing BERT training loop for performance monitoring.

Code Reference

Source Location

Repository: Microsoft_DeepSpeedExamples
File: training/bing_bert/timer.py
Lines: 1-129

Signature

def print_rank_0(message):

class SynchronizedWallClockTimer:
    class Timer:
        def __init__(self, name):
        def start(self):
        def stop(self):
        def reset(self):
        def elapsed(self, reset=True):

    def __init__(self):
    def __call__(self, name):
    def log(self, names, normalizer=1.0, reset=True):

class ThroughputTimer(object):
    def __init__(self, name=None, batch_size=1, num_workers=1, start_step=2):
    def start(self, cond=True):
    def stop(self, cond=True):
    def avg_samples_per_sec(self):
    def avg_steps_per_sec(self):
    def print_elapsed_time(self, num_ops=None):

Import

from timer import SynchronizedWallClockTimer, ThroughputTimer, print_rank_0

I/O Contract

Inputs

Name	Type	Required	Description
name	str	Yes	Timer name for SynchronizedWallClockTimer.Timer; display name for ThroughputTimer
batch_size	int	No	Number of samples per batch for throughput calculation. Default: 1
num_workers	int	No	Number of distributed workers for throughput calculation. Default: 1
start_step	int	No	Number of initial warmup steps to skip in throughput measurement. Default: 2
normalizer	float	No	Divisor for averaging timer values in SynchronizedWallClockTimer.log. Default: 1.0
num_ops	int	No	Number of floating point operations for TFlops computation in print_elapsed_time

Outputs

Name	Type	Description
elapsed	float	Elapsed time in seconds from SynchronizedWallClockTimer.Timer.elapsed()
avg_samples_per_sec	float	Average training samples per second (returns -999 if insufficient steps)
avg_steps_per_sec	float	Average training steps per second (returns -999 if insufficient steps)

Usage Examples

from timer import SynchronizedWallClockTimer, ThroughputTimer

# GPU-synchronized timing
timers = SynchronizedWallClockTimer()
timers('forward').start()
output = model(input_batch)
timers('forward').stop()
timers('backward').start()
loss.backward()
timers('backward').stop()
timers.log(['forward', 'backward'], normalizer=1.0)

# Throughput profiling
throughput = ThroughputTimer(name="training", batch_size=32, num_workers=8)
for step in range(num_steps):
    throughput.start()
    # ... training step ...
    throughput.stop()
print(f"Throughput: {throughput.avg_samples_per_sec():.1f} samples/sec")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment