Implementation:Ggml org Ggml Sycl copy

Knowledge Sources	GGML
Domains	ML_Infrastructure, GPU_Compute
Last Updated	2025-05-15 12:00 GMT

Overview

SYCL tensor copy kernels supporting format conversion across a wide range of type combinations including f32, f16, i16, i32, and quantized formats.

Description

cpy.cpp implements the GGML_OP_CPY (copy) and GGML_OP_DUP (duplicate) operations for the SYCL backend. The module provides:

Per-element copy functions: Simple type conversion functions (cpy_1_f32_f32, cpy_1_f32_f16, cpy_1_f16_f32, cpy_1_i16_i16, cpy_1_i32_i32) that handle single-element format conversion using SYCL vector conversion intrinsics.
Strided copy kernel: The cpy_f32_f16 template kernel maps a flat element index back to 4D tensor coordinates (i00-i03 for source, i10-i13 for destination), computes byte offsets using per-dimension strides, and invokes the appropriate per-element copy function. This supports non-contiguous memory layouts.
Block-wise quantized copy: cpy_blck_q_q for same-type quantized block copies, and cpy_blck_q8_0_f32 for dequantizing q8_0 blocks to f32 during copy.
Dispatch function: ggml_sycl_cpy selects the correct kernel based on the source and destination type combination, supporting f32->f32, f32->f16, f16->f16, f16->f32, f32->q8_0, f32->q4_0, f32->q4_1, and quantized-to-quantized copies.

Usage

Called by the main SYCL backend when the compute graph contains CPY or DUP operations. The copy function handles both same-type copies and format conversions, including quantization during copy.

Code Reference

Source Location

Repository: GGML
File: src/ggml-sycl/cpy.cpp
Lines: 602

Signatures

// Per-element copy functions
static void cpy_1_f32_f32(const char * cxi, char * cdsti);
static void cpy_1_f32_f16(const char * cxi, char * cdsti);
static void cpy_1_f16_f16(const char * cxi, char * cdsti);
static void cpy_1_f16_f32(const char * cxi, char * cdsti);

// Strided copy kernel template
template <cpy_kernel_t cpy_1>
static void cpy_f32_f16(const char * cx, char * cdst, const int ne,
                        const int ne00, const int ne01, const int ne02,
                        const int nb00, const int nb01, const int nb02, const int nb03,
                        const int ne10, const int ne11, const int ne12,
                        const int nb10, const int nb11, const int nb12, const int nb13,
                        const sycl::nd_item<3> & item_ct1);

// Public entry points
void ggml_sycl_cpy(ggml_backend_sycl_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1);
void ggml_sycl_dup(ggml_backend_sycl_context & ctx, ggml_tensor * dst);

I/O Contract

Inputs

Name	Type	Required	Description
ctx	ggml_backend_sycl_context &	Yes	SYCL backend context providing the device queue
src0	const ggml_tensor *	Yes	Source tensor to copy from
src1	const ggml_tensor *	Yes	Destination tensor (for cpy) defining target shape and type

Outputs

Name	Type	Description
src1->data	void *	Destination buffer with copied and optionally type-converted data

Usage Examples

// Copy with format conversion (f32 -> f16):
// The backend dispatcher calls this when encountering a CPY node
ggml_sycl_cpy(sycl_ctx, f32_tensor, f16_tensor);

// Duplicate a tensor (same type copy):
ggml_sycl_dup(sycl_ctx, dst_tensor);

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment