Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy Core ArrayOps

From Leeroopedia


Knowledge Sources
Domains GPU_Kernels, Data_Structures
Last Updated 2026-02-07 15:00 GMT

Overview

Comprehensive set of element-wise arithmetic operators, memory access primitives, and reduction utilities for the Array<T, N> type.

Description

This header provides the operational layer on top of Array<T, N>. It includes: (1) element-wise binary operators (plus, minus, multiplies) for vector-vector and vector-scalar combinations in the ops namespace; (2) utility functions fill(), clear(), copy(), and cast() for array manipulation; (3) optimized memory access functions (Load, Store, Ldg, Ldcs, Ldcg, Stcs, Stcg) that use compile-time dispatch based on array size to emit vectorized 128/64/32-bit memory operations; (4) shared memory operations (LdShared, StShared) using inline PTX; (5) async copy via CpAsync for SM80+; (6) a blockSum warp-reduction utility; and (7) matrix transpose helpers (transpose_m8n8_b16, transpose_m8n8_b32).

Usage

Use these functions whenever performing arithmetic on Array fragments inside CUDA kernels, loading/storing data from global or shared memory, or performing block-level reductions in the TurboMind inference pipeline.

Code Reference

Source Location

Signature

// Arithmetic (in namespace ops)
template<typename T, int N> Array<T, N> operator+(const Array<T, N>& a, const Array<T, N>& b);
template<typename T, int N> Array<T, N> operator*(const Array<T, N>& a, const Array<T, N>& b);

// Utility
template<typename To, typename From, int N> Array<To, N> cast(const Array<From, N>& src);
template<class T, int N> void fill(Array<T, N>& x, T val);
template<class T, int N> void clear(Array<T, N>& x);

// Memory access
template<typename T, int N> void Load(Array<T, N>& dst, const T* src);
template<typename T, int N> void Store(T* dst, const Array<T, N>& src);
template<typename T, int N> void Ldg(Array<T, N>& dst, const T* src);
template<typename T, int N> void LdShared(Array<T, N>& dst, uint32_t uintptr);

// Reduction
template<int kWarpCount, typename T, int N>
Array<T, N> blockSum(Array<T, N> val, T* smem_red, int warp_id, int lane_id);

Import

#include "src/turbomind/kernels/core/array_ops.h"

I/O Contract

Inputs

Name Type Required Description
a, b Array<T, N> Yes Input arrays for binary operations
src const T* Yes Source pointer for load operations (global or shared memory)
dst T* Yes Destination pointer for store operations
val Array<T, N> Yes Per-thread values for blockSum reduction
smem_red T* Yes Shared memory scratch space for blockSum

Outputs

Name Type Description
result Array<T, N> Result of arithmetic or cast operation
dst Array<T, N>& Loaded data for Load/Ldg/LdShared
blockSum return Array<T, N> Block-wide sum across all warps

Usage Examples

using namespace turbomind;

// Vectorized load from global memory
Array<half, 8> frag;
Ldg(frag, global_ptr);

// Element-wise multiply-add
using namespace ops;
Array<float, 4> a, b, c;
c = a + b;
c = a * b;

// Block-level sum reduction
auto sum = blockSum<4>(val, smem, warp_id, lane_id);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment