Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:ClickHouse ClickHouse SIMD Memory Comparison

From Leeroopedia


Knowledge Sources
Domains Memory, SIMD
Last Updated 2026-02-08 00:00 GMT

Overview

Accelerating memory comparison by processing 16 or more bytes simultaneously using vector instructions, combined with allowing safe overflow reads to eliminate edge case branches.

Description

SIMD memory comparison leverages vector processing units to compare multiple bytes in parallel. Instead of comparing byte-by-byte, the algorithm loads 16-byte chunks into SIMD registers and performs parallel equality comparisons. A key innovation is the "allow overflow" approach: by guaranteeing that memory is readable for 15 extra bytes beyond the valid region, the implementation can use aligned SIMD operations without complex edge case handling. This eliminates branches and simplifies the code while maintaining safety through allocator-level guarantees.

The principle combines two optimizations: parallel processing via SIMD and simplified control flow via overflow tolerance. Modern CPUs can compare 16 bytes in a single instruction that would otherwise require 16 comparisons and up to 16 branches. The bit mask output from SIMD comparison instructions directly indicates which bytes differ, enabling fast early-exit using count-trailing-zeros instructions.

Usage

Use SIMD memory comparison for comparing short strings (under 256 bytes), fixed-width database columns, hash values, UUIDs, and other small fixed-size data. This is the right choice when comparison is in a hot path (sorting, hash table lookups, deduplication). Choose standard `memcmp` for large regions where the algorithm difference matters less, or when the overflow guarantee cannot be provided.

Theoretical Basis

Performance Analysis:

Byte-by-byte comparison:

  • 1 load + 1 compare + 1 branch per byte
  • For 16 bytes: ~48 cycles (with branch mispredictions)
  • Early exit helps but unpredictable

SIMD comparison (SSE2/NEON):

  • 2 loads (16 bytes each) = ~2 cycles
  • 1 vector compare = ~1 cycle
  • 1 movemask + 1 test = ~2 cycles
  • Total: ~5 cycles for 16 bytes
  • Speedup: ~10x for equal strings, ~5x on average

SIMD comparison (AVX-512):

  • Same as SSE2 but with dedicated comparison mask instructions
  • Slightly faster: ~4 cycles for 16 bytes
  • Can process 64 bytes at once: ~8 cycles
  • Speedup: ~15x for longer strings

The Overflow Trick:

Traditional approach requires handling partial chunks: ``` while (size >= 16) {

   // Compare 16 bytes
   size -= 16;

} // Handle remaining 0-15 bytes with special code ```

With overflow allowance: ``` while (size >= 16) {

   // Compare 16 bytes
   size -= 16;

} // Compare final 16 bytes with overlap (may re-compare some bytes) // No special case needed! ```

The final comparison reads beyond the valid region but the allocator guarantees this memory is readable. This eliminates ~10-20 cycles of branch logic and special case handling.

Why It Works:

Memory allocators typically provide padding:

  • `malloc` allocates in chunks (often 16-byte aligned)
  • String allocators often round up to power of 2
  • Stack arrays often have padding
  • Most practical strings have exploitable padding

The implementation uses Memory Sanitizer annotations to mark overflow regions as valid, ensuring tools don't report false positives.

Optimization for Special Cases:

16-byte comparison (UUID, hash):

  • Single SIMD instruction
  • No loop overhead
  • 3-4 cycles total
  • 10-15x faster than byte-by-byte

Zero-checking:

  • Compare against zero vector
  • Same SIMD parallelism
  • Useful for detecting padding, cleared memory

Platform Variations:

x86 SSE2:

  • `_mm_cmpeq_epi8` + `_mm_movemask_epi8`
  • Available on all modern x86-64 CPUs
  • Produces 16-bit mask

x86 AVX-512:

  • `_mm_cmp_epi8_mask`
  • Direct mask comparison
  • More efficient instruction encoding

ARM NEON:

  • `vceqq_u8` + nibble mask extraction
  • Similar performance to SSE2
  • Different instruction encoding

Cache Efficiency:

Sequential memory access patterns play well with cache prefetchers. Reading 16 bytes at a time aligns with cache line boundaries (64 bytes), improving throughput. The SIMD registers don't use cache capacity, further improving efficiency.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment