Implementation:Sgl project Sglang CPU RoPE
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, CPU Kernels |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implements CPU-optimized Rotary Position Embedding (RoPE) kernels supporting multiple tensor layouts (3D, 4D) and embedding styles (standard interleaved and NeoX-style split-half) for position encoding in transformer models.
Description
This file provides three internal kernel implementations:
- rotary_embedding_3D_kernel_impl -- Handles the standard 3D tensor layout [num_tokens, num_heads, head_size], applying cosine-sine rotation pairs element-wise to query and key tensors. It parallelizes across num_tokens * num_heads and looks up position-dependent cos/sin values from a precomputed cos_sin_cache. Uses scalar element-by-element rotation: out1 = in1 * cos - in2 * sin, out2 = in2 * cos + in1 * sin.
- rotary_embedding_neox_4D_kernel_impl -- Handles the 4D layout [batch, seq_len, num_heads, head_size] with NeoX-style embedding where the rotary dimension is split into two halves (first half and second half) rather than interleaved pairs. Uses SIMD vectorization with at::vec::Vectorized for the cos/sin multiply-add operations, processing bVecSize elements at a time with float32 intermediate computation.
- rotary_embedding_4D_kernel_impl -- Standard interleaved rotation for 4D input tensors.
Both kernels support separate rotary dimensions for query and key, enabling architectures like DeepSeek MLA where key_rotary_dim differs from query_rotary_dim. The single public API function rotary_embedding_cpu dispatches to the appropriate kernel based on input dimensionality and the is_neox flag.
Usage
Use this kernel for all RoPE operations during CPU LLM inference. RoPE is used in virtually all modern LLM architectures (LLaMA, Mistral, Qwen, DeepSeek). The 3D layout is used during serving (token-by-token), while the 4D layout is used for batch prefill. The NeoX variant is needed for GPT-NeoX derived architectures.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: sgl-kernel/csrc/cpu/rope.cpp
- Lines: 1-387
Signature
// Public API
std::tuple<at::Tensor, at::Tensor> rotary_embedding_cpu(
at::Tensor& positions,
at::Tensor& query,
at::Tensor& key,
int64_t head_size,
at::Tensor& cos_sin_cache,
bool is_neox);
// Internal kernels
template <typename scalar_t>
void rotary_embedding_3D_kernel_impl(
scalar_t* __restrict__ query_out,
scalar_t* __restrict__ key_out,
int64_t* __restrict__ positions,
scalar_t* __restrict__ query,
scalar_t* __restrict__ key,
scalar_t* __restrict__ cos_sin_cache,
int64_t num_tokens, int64_t num_heads,
int64_t num_kv_heads, int64_t head_size,
int64_t rotary_dim,
int64_t query_stride_s, int64_t query_out_stride_s,
int64_t key_out_stride_s, int64_t key_stride_s,
int64_t query_stride_h, int64_t query_out_stride_h);
template <typename scalar_t>
void rotary_embedding_neox_4D_kernel_impl(
int64_t* __restrict__ positions,
scalar_t* __restrict__ query,
scalar_t* __restrict__ key,
scalar_t* __restrict__ cos_sin_cache,
int64_t rotary_dim,
int64_t query_stride_b, int64_t query_stride_s,
int64_t query_stride_h,
int64_t key_stride_b, int64_t key_stride_s,
int64_t key_stride_h,
int64_t num_heads, int64_t num_kv_heads,
int64_t head_size,
int64_t batch_size, int64_t seq_len);
Import
#include "common.h"
#include "vec.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| positions | at::Tensor [num_tokens] | Yes | Token position indices for RoPE lookup (int64) |
| query | at::Tensor | Yes | Query tensor in 2D [num_tokens, num_heads*head_size], 3D [num_tokens, num_heads, head_size], or 4D [batch, seq_len, num_heads, head_size] |
| key | at::Tensor | Yes | Key tensor matching query layout |
| head_size | int64_t | Yes | Size of each attention head |
| cos_sin_cache | at::Tensor [max_pos, rotary_dim] | Yes | Precomputed cosine and sine values indexed by position |
| is_neox | bool | Yes | Whether to use NeoX-style (split-half) rotation instead of interleaved |
Outputs
| Name | Type | Description |
|---|---|---|
| query_out | at::Tensor | Query with rotary position embedding applied (same shape as input query) |
| key_out | at::Tensor | Key with rotary position embedding applied (same shape as input key) |
Usage Examples
// Standard RoPE for 3D tensors (serving mode)
auto [query_out, key_out] = rotary_embedding_cpu(
positions, // [num_tokens] int64
query, // [num_tokens, num_heads, head_size]
key, // [num_tokens, num_kv_heads, head_size]
/*head_size=*/128,
cos_sin_cache, // [max_pos, rotary_dim]
/*is_neox=*/false);
// NeoX-style RoPE for 4D tensors (batch prefill)
auto [q_out, k_out] = rotary_embedding_cpu(
positions, // [batch * seq_len] int64
query, // [batch, seq_len, num_heads, head_size]
key, // [batch, seq_len, num_kv_heads, head_size]
/*head_size=*/128,
cos_sin_cache, // [max_pos, rotary_dim]
/*is_neox=*/true);