Implementation:InternLM Lmdeploy Core ArrayOps

Knowledge Sources	InternLM_Lmdeploy
Domains	GPU_Kernels, Data_Structures
Last Updated	2026-02-07 15:00 GMT

Overview

Comprehensive set of element-wise arithmetic operators, memory access primitives, and reduction utilities for the Array<T, N> type.

Description

This header provides the operational layer on top of Array<T, N>. It includes: (1) element-wise binary operators (plus, minus, multiplies) for vector-vector and vector-scalar combinations in the ops namespace; (2) utility functions fill(), clear(), copy(), and cast() for array manipulation; (3) optimized memory access functions (Load, Store, Ldg, Ldcs, Ldcg, Stcs, Stcg) that use compile-time dispatch based on array size to emit vectorized 128/64/32-bit memory operations; (4) shared memory operations (LdShared, StShared) using inline PTX; (5) async copy via CpAsync for SM80+; (6) a blockSum warp-reduction utility; and (7) matrix transpose helpers (transpose_m8n8_b16, transpose_m8n8_b32).

Usage

Use these functions whenever performing arithmetic on Array fragments inside CUDA kernels, loading/storing data from global or shared memory, or performing block-level reductions in the TurboMind inference pipeline.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: src/turbomind/kernels/core/array_ops.h

Signature

// Arithmetic (in namespace ops)
template<typename T, int N> Array<T, N> operator+(const Array<T, N>& a, const Array<T, N>& b);
template<typename T, int N> Array<T, N> operator*(const Array<T, N>& a, const Array<T, N>& b);

// Utility
template<typename To, typename From, int N> Array<To, N> cast(const Array<From, N>& src);
template<class T, int N> void fill(Array<T, N>& x, T val);
template<class T, int N> void clear(Array<T, N>& x);

// Memory access
template<typename T, int N> void Load(Array<T, N>& dst, const T* src);
template<typename T, int N> void Store(T* dst, const Array<T, N>& src);
template<typename T, int N> void Ldg(Array<T, N>& dst, const T* src);
template<typename T, int N> void LdShared(Array<T, N>& dst, uint32_t uintptr);

// Reduction
template<int kWarpCount, typename T, int N>
Array<T, N> blockSum(Array<T, N> val, T* smem_red, int warp_id, int lane_id);

Import

#include "src/turbomind/kernels/core/array_ops.h"

I/O Contract

Inputs

Name	Type	Required	Description
a, b	Array<T, N>	Yes	Input arrays for binary operations
src	const T*	Yes	Source pointer for load operations (global or shared memory)
dst	T*	Yes	Destination pointer for store operations
val	Array<T, N>	Yes	Per-thread values for blockSum reduction
smem_red	T*	Yes	Shared memory scratch space for blockSum

Outputs

Name	Type	Description
result	Array<T, N>	Result of arithmetic or cast operation
dst	Array<T, N>&	Loaded data for Load/Ldg/LdShared
blockSum return	Array<T, N>	Block-wide sum across all warps

Usage Examples

using namespace turbomind;

// Vectorized load from global memory
Array<half, 8> frag;
Ldg(frag, global_ptr);

// Element-wise multiply-add
using namespace ops;
Array<float, 4> a, b, c;
c = a + b;
c = a * b;

// Block-level sum reduction
auto sum = blockSum<4>(val, smem, warp_id, lane_id);

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment