Implementation:InternLM Lmdeploy RotaryEmbedding

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Attention
Last Updated	2026-02-07 15:00 GMT

Overview

Device-side rotary positional embedding (RoPE) implementations supporting default, linear, dynamic, YaRN, Llama3, and multi-rope (M-RoPE) variants, plus log-n attention scaling.

Description

This header provides multiple RoPE implementations for use within attention kernels:

FastRoPE<N>: The primary RoPE class used by AttentionUniversal::Prologue. Supports all RoPE variants via a runtime type switch. The init method precomputes inverse frequencies based on the variant (default, YaRN with linear ramp interpolation, Llama3 with smooth interpolation), and apply rotates a vector fragment using sincos. For M-RoPE (multi-modal), apply_mrope looks up per-dimension position IDs from a table.
RotaryEmbedding<N>: A simpler alternative that precomputes cos/sin coefficients from base frequency and dimension index. Uses __noinline__ on the coefficient computation to ensure consistent results.
RoPE<N, C>: A minimal variant that stores precomputed inverse frequencies and applies rotation.
LogNScaling: Implements log-n attention scaling that scales Q by log2(seq_len) / log2(max_position_embeddings) when sequence length exceeds the training length.
ApplyRotaryEmbedding: A standalone device function for applying RoPE to 4-element vectors.

Usage

FastRoPE is instantiated per-thread in the attention kernel prologue to apply positional encoding to Q and K vectors before they enter the attention mainloop.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/attention/rotary_embedding.h
Lines: 1-282

Signature

namespace turbomind {

template<int N>
struct FastRoPE {
    __device__ FastRoPE(const RopeKernelParam& param, int batch_idx, std::integral_constant<int, N>);
    __device__ void init(int idx);
    template<typename T>
    __device__ void apply(Array<T, N>& x, float timestep);
    template<typename T>
    __device__ void apply_mrope(Array<T, N>& x, float timestep);
};

template<int N>
struct RotaryEmbedding {
    __device__ RotaryEmbedding(float base, int dims, int timestep, int2 offset);
    static __device__ __noinline__ float2 get_coefficient(int idx, int dims, float base, int timestep);
    template<typename T>
    __device__ void apply(Array<T, N>& x);
};

template<int N, int C = 8>
struct RoPE {
    __device__ RoPE(float idx, float base, float dims);
    template<class T>
    __device__ void apply(Array<T, N * 2>& x, float timestep);
};

struct LogNScaling {
    __device__ LogNScaling(int seq_len, int max_position_embeddings);
    template<typename T, int N>
    __device__ void apply(Array<T, N>& x) const;
};

template<class C, class T>
__device__ void ApplyRotaryEmbedding(Array<T, 4>& x, float base, int dims, int ti, int di);

} // namespace turbomind

Import

#include "src/turbomind/kernels/attention/rotary_embedding.h"

I/O Contract

Inputs

Name	Type	Required	Description
param	RopeKernelParam	Yes	RoPE configuration (type, base, dim, scaling factors)
batch_idx	int	Yes	Batch index (for dynamic base or M-RoPE position lookup)
idx	int	Yes	Dimension index within the head
timestep	float	Yes	Token position for rotation angle computation
x	Array<T, N>&	Yes	Vector fragment to rotate (modified in-place)

Outputs

Name	Type	Description
x (modified)	Array<T, N>&	Rotated vector fragment

Usage Examples

// In attention prologue:
FastRoPE rope(params.rope_param, batch_idx, std::integral_constant<int, 8>{});
rope.init(dim_offset);
rope.apply(vec_Q, timestep);
rope.apply(vec_K, timestep);

// Log-n scaling:
LogNScaling logn(seq_pos, max_pos_embeddings);
logn.apply(vec_Q);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment