Implementation:InternLM Lmdeploy Core FloatingPoint
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Quantization |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Compile-time floating-point introspection and conversion utilities for custom low-bit float formats (e.g., E2M1, E3M2, E2M3).
Description
The FloatingPoint<E, M> template struct provides compile-time constants and device-side conversion functions for arbitrary floating-point formats parameterized by exponent bits (E) and mantissa bits (M). It computes format properties at compile time: exponent_bias, max_normal, min_normal, max_denormal, min_denormal, and bit masks. Two device functions handle conversion: from_f32() converts an IEEE 754 float to the custom format with support for both round-to-nearest-even and stochastic rounding, and to_f32() converts back to float using PTX mul.f32 to avoid flush-to-zero. Static assertions verify correctness for E2M1, E3M2, and E2M3 formats.
Usage
Use this utility when implementing quantization or dequantization kernels that operate with custom low-bit floating-point representations such as FP4 (E2M1), FP6, or FP8 formats.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/core/floating_point.h
Signature
template<int E, int M>
struct FloatingPoint {
static constexpr unsigned exponent_bits;
static constexpr unsigned mantissa_bits;
static constexpr unsigned exponent_bias;
static constexpr float max_normal;
static constexpr float min_normal;
template<class R>
__device__ static unsigned from_f32(float x, R rbits);
__device__ static float to_f32(unsigned x);
};
Import
#include "src/turbomind/kernels/core/floating_point.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| E | int | Yes | Number of exponent bits in the target format |
| M | int | Yes | Number of mantissa bits in the target format |
| x (from_f32) | float | Yes | IEEE 754 float value to convert |
| rbits | unsigned or std::nullptr_t | Yes | Random bits for stochastic rounding (unsigned) or tag for RNE (non-unsigned) |
| x (to_f32) | unsigned | Yes | Packed custom-format value to convert to float |
Outputs
| Name | Type | Description |
|---|---|---|
| from_f32 return | unsigned | Packed representation in the custom float format |
| to_f32 return | float | IEEE 754 float equivalent |
Usage Examples
using FP4 = turbomind::FloatingPoint<2, 1>; // E2M1 format
// Convert float to FP4 with round-to-nearest-even
unsigned packed = FP4::from_f32(3.5f, 0);
// Convert FP4 back to float
float val = FP4::to_f32(packed);
// Stochastic rounding
unsigned rbits = curand(&state);
unsigned packed_sr = FP4::from_f32(3.5f, rbits);