Implementation:FMInference FlexLLMGen DeepSpeed Runtime Utils

Field	Value
Sources	Repo: FlexLLMGen, Upstream: DeepSpeed
Domains	Runtime_Infrastructure, Distributed_Training
Last Updated	2026-02-09 00:00 GMT

Overview

Vendored DeepSpeed utility module providing helper functions for gradient operations, memory monitoring, tensor parallelism, random seed management, and distributed communication used throughout the runtime.

Description

The utils.py file (1018 lines) is a vendored copy of DeepSpeed's runtime utility collection, containing helper functions sourced from NVIDIA's Megatron-LM and extended for DeepSpeed's needs.

Key components include:

Gradient utilities:
- clip_grad_norm_ -- Clips gradient global norm across all parameters, handling model-parallel and expert-parallel parameter groups separately.
- get_global_norm_of_tensors -- Computes the global L2 norm across a list of tensors, with all-reduce for distributed computation.
- clip_tensors_by_global_norm -- Scales tensors to enforce a maximum global norm.
- get_grad_norm -- Computes gradient norm with special handling for model-parallel parameters (avoiding double-counting across ranks).

Memory monitoring:
- see_memory_usage -- Logs GPU memory allocated, cached, and max allocated, plus CPU RAM and virtual memory usage via psutil. Optionally forces garbage collection and CUDA cache clearing.

Parallelism helpers:
- is_model_parallel_parameter -- Checks if a parameter is marked for tensor model parallelism.
- bwc_tensor_model_parallel_rank -- Backwards-compatible query for tensor model parallel rank, supporting both old and new Megatron API conventions.
- align_dense_tensors -- Aligns tensor storage to NCCL boundaries for efficient communication.
- all_gather_dp_groups -- All-gathers partitioned tensors across data-parallel groups.

General utilities:
- DummyOptim -- A dummy optimizer that presents model parameters as a param group, used for ZeRO-3 without an optimizer.
- set_random_seed -- Sets seeds for Python's random, numpy, and torch PRNGs.
- ensure_directory_exists -- Creates directory paths for checkpoint saving.
- get_ma_status -- Retrieves mixed-precision autocast status.

Usage

These utilities are called throughout the DeepSpeed runtime by the engine, optimizers, and ZeRO implementations. They are not typically called directly by users. This module is part of the vendored benchmark dependencies in FlexLLMGen.

Code Reference

Field	Value
Repository	FlexLLMGen
File	benchmark/third_party/DeepSpeed/deepspeed/runtime/utils.py
Lines	1-1018
Type	AUTO_KEEP (vendored dependency)

Key function signatures:

class DummyOptim():
    def __init__(self, params):
        self.param_groups = [{'params': params}]

def see_memory_usage(message, force=False):
    ...

def is_model_parallel_parameter(p) -> bool:
    ...

def bwc_tensor_model_parallel_rank(mpu=None):
    ...

def clip_grad_norm_(parameters, max_norm, norm_type=2, mpu=None):
    ...

def get_global_norm_of_tensors(input_tensors, norm_type=2, mpu=None):
    ...

I/O Contract

Inputs

Parameter	Type	Required	Description
parameters	Iterable[Tensor]	Yes	Model parameters for gradient operations
max_norm	float	Yes	Maximum gradient norm for clipping
norm_type	int	No	Norm type for gradient computation (default: 2)
mpu	object	No	Model parallel unit for distributed norm computation

Outputs

Output	Type	Description
total_norm	float	Computed global gradient norm
clipped_grads	Tensor	Gradients scaled to enforce the maximum norm constraint
memory_info	log output	GPU/CPU memory usage statistics

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment