Implementation:Microsoft DeepSpeedExamples BingBert Timer
| Knowledge Sources | |
|---|---|
| Domains | Performance Profiling, Distributed Training |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
A timing utility module providing GPU-synchronized wall clock timers and throughput profiling with samples-per-second and steps-per-second metrics.
Description
timer.py provides two timer classes for profiling distributed BERT training performance. The SynchronizedWallClockTimer class (borrowed from NVIDIA Megatron) manages a collection of named timers that call torch.cuda.synchronize() before starting and stopping to ensure accurate measurement of GPU operations. Each timer tracks cumulative elapsed time and supports start/stop/reset operations with an elapsed method that optionally auto-resets after reading.
The ThroughputTimer class measures training throughput by tracking step durations and computing average samples per second and steps per second. It includes a configurable start_step parameter (defaulting to 2) that skips the first few warmup steps from throughput calculations, avoiding skewed measurements from initialization overhead. The print_elapsed_time method periodically logs forward pass execution time and optionally computes TFlops when given an operation count.
A helper function print_rank_0 ensures that log messages are printed only by rank 0 in a distributed training setup, preventing duplicate output from multiple processes. The SynchronizedWallClockTimer.log method formats multiple timer readings into a single log line with millisecond precision.
Usage
Use SynchronizedWallClockTimer for precise measurement of individual training phases (forward pass, backward pass, optimizer step) in GPU-accelerated training. Use ThroughputTimer for aggregate throughput measurements across the training run. Both are used in the Bing BERT training loop for performance monitoring.
Code Reference
Source Location
- Repository: Microsoft_DeepSpeedExamples
- File: training/bing_bert/timer.py
- Lines: 1-129
Signature
def print_rank_0(message):
class SynchronizedWallClockTimer:
class Timer:
def __init__(self, name):
def start(self):
def stop(self):
def reset(self):
def elapsed(self, reset=True):
def __init__(self):
def __call__(self, name):
def log(self, names, normalizer=1.0, reset=True):
class ThroughputTimer(object):
def __init__(self, name=None, batch_size=1, num_workers=1, start_step=2):
def start(self, cond=True):
def stop(self, cond=True):
def avg_samples_per_sec(self):
def avg_steps_per_sec(self):
def print_elapsed_time(self, num_ops=None):
Import
from timer import SynchronizedWallClockTimer, ThroughputTimer, print_rank_0
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| name | str | Yes | Timer name for SynchronizedWallClockTimer.Timer; display name for ThroughputTimer |
| batch_size | int | No | Number of samples per batch for throughput calculation. Default: 1 |
| num_workers | int | No | Number of distributed workers for throughput calculation. Default: 1 |
| start_step | int | No | Number of initial warmup steps to skip in throughput measurement. Default: 2 |
| normalizer | float | No | Divisor for averaging timer values in SynchronizedWallClockTimer.log. Default: 1.0 |
| num_ops | int | No | Number of floating point operations for TFlops computation in print_elapsed_time |
Outputs
| Name | Type | Description |
|---|---|---|
| elapsed | float | Elapsed time in seconds from SynchronizedWallClockTimer.Timer.elapsed() |
| avg_samples_per_sec | float | Average training samples per second (returns -999 if insufficient steps) |
| avg_steps_per_sec | float | Average training steps per second (returns -999 if insufficient steps) |
Usage Examples
from timer import SynchronizedWallClockTimer, ThroughputTimer
# GPU-synchronized timing
timers = SynchronizedWallClockTimer()
timers('forward').start()
output = model(input_batch)
timers('forward').stop()
timers('backward').start()
loss.backward()
timers('backward').stop()
timers.log(['forward', 'backward'], normalizer=1.0)
# Throughput profiling
throughput = ThroughputTimer(name="training", batch_size=32, num_workers=8)
for step in range(num_steps):
throughput.start()
# ... training step ...
throughput.stop()
print(f"Throughput: {throughput.avg_samples_per_sec():.1f} samples/sec")