Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy AnomalyHandler

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Debugging
Last Updated 2026-02-07 15:00 GMT

Overview

Singleton utility for detecting and optionally fixing NaN/Inf anomalies in GPU tensors during inference, with configurable severity levels and summarization.

Description

The AnomalyHandler class is a singleton that provides runtime NaN/Inf detection for debugging numerical issues in transformer inference. It is initialized with rank, vocabulary size, a fallback token ID, max batch size, and a CUDA stream. CountAndFix() scans a typed GPU buffer for anomalous values (NaN, Inf) and optionally replaces them, keyed by a string identifier and gated by a severity level. FixLogits() is a specialized variant for logit tensors. Summarize() reports anomaly counts via a callback. Reset() clears accumulated state. The static level() method returns the current anomaly detection level (controlled externally). Convenience macros TM_DEBUG_RAW and TM_DEBUG_TENSOR provide level-gated anomaly checking with zero overhead when the level is below the threshold. The class supports up to 65536 entries.

Usage

Use this handler during development or debugging to detect where NaN/Inf values first appear in the inference pipeline. Enable it by setting the anomaly detection level; at level 0 it is a no-op.

Code Reference

Source Location

Signature

class AnomalyHandler {
public:
    static constexpr size_t max_entries = 65536;

    static AnomalyHandler& instance();
    static int level() noexcept;

    void Init(int rank, int vocab_size, int fallback, int max_batch_size,
              cudaStream_t stream) noexcept;

    template<class T>
    void CountAndFix(T* data, int64_t size, std::string key, int level);

    template<class T>
    void FixLogits(T* logits, int batch_size, int level);

    void Summarize(std::function<void(const int*, int)> handler);
    void Reset();
};

// Convenience free function
template<class T>
void count_and_fix(T* data, size_t size, std::string key, int level);

void DebugTensor(Tensor& tensor, const std::string& key, int level);

Import

#include "src/turbomind/utils/anomaly_handler.h"

I/O Contract

Inputs

Name Type Required Description
data T* Yes GPU buffer to scan for anomalies
size int64_t Yes Number of elements in the buffer
key std::string Yes Identifier for this check point (e.g., layer name)
level int Yes Severity level threshold for this check
rank int Yes (Init) GPU/process rank for multi-GPU setups

Outputs

Name Type Description
data T* Input buffer with anomalous values optionally replaced (in-place)
Summarize callback function Receives anomaly counts for reporting

Usage Examples

using namespace turbomind;

// Initialize once
AnomalyHandler::instance().Init(rank, vocab_size, eos_id, max_batch, stream);

// Check hidden states after each layer
TM_DEBUG_TENSOR(hidden_states, "layer_3_output", 1);

// Check raw pointer
TM_DEBUG_RAW(logits_ptr, batch_size * vocab_size, "final_logits", 2);

// Summarize at end of request
AnomalyHandler::instance().Summarize([](const int* counts, int n) {
    // report anomaly counts
});

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment