Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy Core FloatingPoint

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Quantization
Last Updated 2026-02-07 15:00 GMT

Overview

Compile-time floating-point introspection and conversion utilities for custom low-bit float formats (e.g., E2M1, E3M2, E2M3).

Description

The FloatingPoint<E, M> template struct provides compile-time constants and device-side conversion functions for arbitrary floating-point formats parameterized by exponent bits (E) and mantissa bits (M). It computes format properties at compile time: exponent_bias, max_normal, min_normal, max_denormal, min_denormal, and bit masks. Two device functions handle conversion: from_f32() converts an IEEE 754 float to the custom format with support for both round-to-nearest-even and stochastic rounding, and to_f32() converts back to float using PTX mul.f32 to avoid flush-to-zero. Static assertions verify correctness for E2M1, E3M2, and E2M3 formats.

Usage

Use this utility when implementing quantization or dequantization kernels that operate with custom low-bit floating-point representations such as FP4 (E2M1), FP6, or FP8 formats.

Code Reference

Source Location

Signature

template<int E, int M>
struct FloatingPoint {
    static constexpr unsigned exponent_bits;
    static constexpr unsigned mantissa_bits;
    static constexpr unsigned exponent_bias;
    static constexpr float max_normal;
    static constexpr float min_normal;

    template<class R>
    __device__ static unsigned from_f32(float x, R rbits);

    __device__ static float to_f32(unsigned x);
};

Import

#include "src/turbomind/kernels/core/floating_point.h"

I/O Contract

Inputs

Name Type Required Description
E int Yes Number of exponent bits in the target format
M int Yes Number of mantissa bits in the target format
x (from_f32) float Yes IEEE 754 float value to convert
rbits unsigned or std::nullptr_t Yes Random bits for stochastic rounding (unsigned) or tag for RNE (non-unsigned)
x (to_f32) unsigned Yes Packed custom-format value to convert to float

Outputs

Name Type Description
from_f32 return unsigned Packed representation in the custom float format
to_f32 return float IEEE 754 float equivalent

Usage Examples

using FP4 = turbomind::FloatingPoint<2, 1>;  // E2M1 format

// Convert float to FP4 with round-to-nearest-even
unsigned packed = FP4::from_f32(3.5f, 0);

// Convert FP4 back to float
float val = FP4::to_f32(packed);

// Stochastic rounding
unsigned rbits = curand(&state);
unsigned packed_sr = FP4::from_f32(3.5f, rbits);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment