Implementation:Sgl project Sglang CPU RoPE

Knowledge Sources	Sgl_project_Sglang
Domains	Machine Learning, CPU Kernels
Last Updated	2026-02-10 00:00 GMT

Overview

Implements CPU-optimized Rotary Position Embedding (RoPE) kernels supporting multiple tensor layouts (3D, 4D) and embedding styles (standard interleaved and NeoX-style split-half) for position encoding in transformer models.

Description

This file provides three internal kernel implementations:

rotary_embedding_3D_kernel_impl -- Handles the standard 3D tensor layout [num_tokens, num_heads, head_size], applying cosine-sine rotation pairs element-wise to query and key tensors. It parallelizes across num_tokens * num_heads and looks up position-dependent cos/sin values from a precomputed cos_sin_cache. Uses scalar element-by-element rotation: out1 = in1 * cos - in2 * sin, out2 = in2 * cos + in1 * sin.

rotary_embedding_neox_4D_kernel_impl -- Handles the 4D layout [batch, seq_len, num_heads, head_size] with NeoX-style embedding where the rotary dimension is split into two halves (first half and second half) rather than interleaved pairs. Uses SIMD vectorization with at::vec::Vectorized for the cos/sin multiply-add operations, processing bVecSize elements at a time with float32 intermediate computation.

rotary_embedding_4D_kernel_impl -- Standard interleaved rotation for 4D input tensors.

Both kernels support separate rotary dimensions for query and key, enabling architectures like DeepSeek MLA where key_rotary_dim differs from query_rotary_dim. The single public API function rotary_embedding_cpu dispatches to the appropriate kernel based on input dimensionality and the is_neox flag.

Usage

Use this kernel for all RoPE operations during CPU LLM inference. RoPE is used in virtually all modern LLM architectures (LLaMA, Mistral, Qwen, DeepSeek). The 3D layout is used during serving (token-by-token), while the 4D layout is used for batch prefill. The NeoX variant is needed for GPT-NeoX derived architectures.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: sgl-kernel/csrc/cpu/rope.cpp
Lines: 1-387

Signature

// Public API
std::tuple<at::Tensor, at::Tensor> rotary_embedding_cpu(
    at::Tensor& positions,
    at::Tensor& query,
    at::Tensor& key,
    int64_t head_size,
    at::Tensor& cos_sin_cache,
    bool is_neox);

// Internal kernels
template <typename scalar_t>
void rotary_embedding_3D_kernel_impl(
    scalar_t* __restrict__ query_out,
    scalar_t* __restrict__ key_out,
    int64_t* __restrict__ positions,
    scalar_t* __restrict__ query,
    scalar_t* __restrict__ key,
    scalar_t* __restrict__ cos_sin_cache,
    int64_t num_tokens, int64_t num_heads,
    int64_t num_kv_heads, int64_t head_size,
    int64_t rotary_dim,
    int64_t query_stride_s, int64_t query_out_stride_s,
    int64_t key_out_stride_s, int64_t key_stride_s,
    int64_t query_stride_h, int64_t query_out_stride_h);

template <typename scalar_t>
void rotary_embedding_neox_4D_kernel_impl(
    int64_t* __restrict__ positions,
    scalar_t* __restrict__ query,
    scalar_t* __restrict__ key,
    scalar_t* __restrict__ cos_sin_cache,
    int64_t rotary_dim,
    int64_t query_stride_b, int64_t query_stride_s,
    int64_t query_stride_h,
    int64_t key_stride_b, int64_t key_stride_s,
    int64_t key_stride_h,
    int64_t num_heads, int64_t num_kv_heads,
    int64_t head_size,
    int64_t batch_size, int64_t seq_len);

Import

#include "common.h"
#include "vec.h"

I/O Contract

Inputs

Name	Type	Required	Description
positions	at::Tensor [num_tokens]	Yes	Token position indices for RoPE lookup (int64)
query	at::Tensor	Yes	Query tensor in 2D [num_tokens, num_heads*head_size], 3D [num_tokens, num_heads, head_size], or 4D [batch, seq_len, num_heads, head_size]
key	at::Tensor	Yes	Key tensor matching query layout
head_size	int64_t	Yes	Size of each attention head
cos_sin_cache	at::Tensor [max_pos, rotary_dim]	Yes	Precomputed cosine and sine values indexed by position
is_neox	bool	Yes	Whether to use NeoX-style (split-half) rotation instead of interleaved

Outputs

Name	Type	Description
query_out	at::Tensor	Query with rotary position embedding applied (same shape as input query)
key_out	at::Tensor	Key with rotary position embedding applied (same shape as input key)

Usage Examples

// Standard RoPE for 3D tensors (serving mode)
auto [query_out, key_out] = rotary_embedding_cpu(
    positions,         // [num_tokens] int64
    query,             // [num_tokens, num_heads, head_size]
    key,               // [num_tokens, num_kv_heads, head_size]
    /*head_size=*/128,
    cos_sin_cache,     // [max_pos, rotary_dim]
    /*is_neox=*/false);

// NeoX-style RoPE for 4D tensors (batch prefill)
auto [q_out, k_out] = rotary_embedding_cpu(
    positions,         // [batch * seq_len] int64
    query,             // [batch, seq_len, num_heads, head_size]
    key,               // [batch, seq_len, num_kv_heads, head_size]
    /*head_size=*/128,
    cos_sin_cache,     // [max_pos, rotary_dim]
    /*is_neox=*/true);

Related Pages

Environment:Sgl_project_Sglang_CPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment