Implementation:InternLM Lmdeploy Core ArrayOps
| Knowledge Sources | |
|---|---|
| Domains | GPU_Kernels, Data_Structures |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Comprehensive set of element-wise arithmetic operators, memory access primitives, and reduction utilities for the Array<T, N> type.
Description
This header provides the operational layer on top of Array<T, N>. It includes: (1) element-wise binary operators (plus, minus, multiplies) for vector-vector and vector-scalar combinations in the ops namespace; (2) utility functions fill(), clear(), copy(), and cast() for array manipulation; (3) optimized memory access functions (Load, Store, Ldg, Ldcs, Ldcg, Stcs, Stcg) that use compile-time dispatch based on array size to emit vectorized 128/64/32-bit memory operations; (4) shared memory operations (LdShared, StShared) using inline PTX; (5) async copy via CpAsync for SM80+; (6) a blockSum warp-reduction utility; and (7) matrix transpose helpers (transpose_m8n8_b16, transpose_m8n8_b32).
Usage
Use these functions whenever performing arithmetic on Array fragments inside CUDA kernels, loading/storing data from global or shared memory, or performing block-level reductions in the TurboMind inference pipeline.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: src/turbomind/kernels/core/array_ops.h
Signature
// Arithmetic (in namespace ops)
template<typename T, int N> Array<T, N> operator+(const Array<T, N>& a, const Array<T, N>& b);
template<typename T, int N> Array<T, N> operator*(const Array<T, N>& a, const Array<T, N>& b);
// Utility
template<typename To, typename From, int N> Array<To, N> cast(const Array<From, N>& src);
template<class T, int N> void fill(Array<T, N>& x, T val);
template<class T, int N> void clear(Array<T, N>& x);
// Memory access
template<typename T, int N> void Load(Array<T, N>& dst, const T* src);
template<typename T, int N> void Store(T* dst, const Array<T, N>& src);
template<typename T, int N> void Ldg(Array<T, N>& dst, const T* src);
template<typename T, int N> void LdShared(Array<T, N>& dst, uint32_t uintptr);
// Reduction
template<int kWarpCount, typename T, int N>
Array<T, N> blockSum(Array<T, N> val, T* smem_red, int warp_id, int lane_id);
Import
#include "src/turbomind/kernels/core/array_ops.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| a, b | Array<T, N> | Yes | Input arrays for binary operations |
| src | const T* | Yes | Source pointer for load operations (global or shared memory) |
| dst | T* | Yes | Destination pointer for store operations |
| val | Array<T, N> | Yes | Per-thread values for blockSum reduction |
| smem_red | T* | Yes | Shared memory scratch space for blockSum |
Outputs
| Name | Type | Description |
|---|---|---|
| result | Array<T, N> | Result of arithmetic or cast operation |
| dst | Array<T, N>& | Loaded data for Load/Ldg/LdShared |
| blockSum return | Array<T, N> | Block-wide sum across all warps |
Usage Examples
using namespace turbomind;
// Vectorized load from global memory
Array<half, 8> frag;
Ldg(frag, global_ptr);
// Element-wise multiply-add
using namespace ops;
Array<float, 4> a, b, c;
c = a + b;
c = a * b;
// Block-level sum reduction
auto sum = blockSum<4>(val, smem, warp_id, lane_id);