Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lance format Lance Encoding Block Statistics

From Leeroopedia


Knowledge Sources
Domains Encoding, Columnar_Data
Last Updated 2026-02-08 19:33 GMT

Overview

The Statistics module computes and retrieves per-block statistics (bit width, data size, cardinality, run count, etc.) that drive compression selection decisions in the Lance encoding pipeline.

Description

When data is accumulated into DataBlock instances, the statistics module computes metrics that help the compression strategy pick the best algorithm. Statistics are stored in each block's BlockInfo (a thread-safe HashMap<Stat, Arc<dyn Array>>).

Available Statistics (Stat enum):

Stat Description Computed For
BitWidth Maximum bit width per 1024-value chunk FixedWidthDataBlock
DataSize Total byte size of the data FixedWidthDataBlock, VariableWidthBlock, OpaqueBlock
Cardinality Approximate unique value count (HyperLogLog++) VariableWidthBlock, FixedWidthDataBlock (64/128-bit)
FixedSize Whether all values have the same length (reserved)
NullCount Number of null values AllNullDataBlock
MaxLength Maximum element byte length VariableWidthBlock, FixedWidthDataBlock
RunCount Number of distinct runs (for RLE analysis) FixedWidthDataBlock
BytePositionEntropy Entropy of byte position distribution (for BSS) FixedWidthDataBlock

Key Traits:

  • ComputeStat -- Implemented by data block types. The compute_stat method calculates all relevant statistics and stores them in the block's BlockInfo. This is called once during block construction.
  • GetStat -- Retrieves a previously computed statistic by Stat key. Panics if called before statistics are computed. Some statistics (like cardinality for 64-bit fixed-width) are computed lazily on first access.

Implementation Details:

  • Bit width is computed per chunk of 1024 values using bitwise OR folding, which finds the maximum value in each chunk and derives the bit width from leading zeros.
  • Cardinality uses the HyperLogLog++ algorithm with precision 4 and xxhash3 for hashing. This provides approximate distinct counts efficiently.
  • Run count counts transitions between consecutive values, giving insight into how well RLE will compress.
  • Byte position entropy measures entropy of byte values at each byte position within fixed-width elements, used to determine whether byte-stream-split encoding would be effective.

Usage

Use this module when:

  • Implementing a compression strategy that needs data statistics to choose algorithms
  • Accessing statistics from a DataBlock to make encoding decisions
  • Adding new statistics for a new compression algorithm

Code Reference

Source Location rust/lance-encoding/src/statistics.rs
Key Enum Stat
Key Traits ComputeStat, GetStat
Import use lance_encoding::statistics::{Stat, ComputeStat, GetStat};

I/O Contract

ComputeStat Trait:

Method Input Output Notes
compute_stat(&mut self) -- -- Populates block_info; must be called exactly once

GetStat Trait:

Method Input Output Notes
get_stat(&self, stat) Stat Option<Arc<dyn Array>> Returns None if stat not available for this block type
expect_stat(&self, stat) Stat Arc<dyn Array> Panics if stat not available
expect_single_stat::<T>(&self, stat) Stat T::Native Extracts single scalar value; panics if not exactly one value

Statistics by Block Type:

Block Type Statistics Computed
FixedWidthDataBlock DataSize, BitWidth, MaxLength, RunCount, BytePositionEntropy
VariableWidthBlock DataSize, Cardinality, MaxLength
OpaqueBlock DataSize
FixedSizeListBlock Delegates to child (with MaxLength scaled by dimension)
AllNullDataBlock NullCount, DataSize (always 0)

Usage Examples

use lance_encoding::data::{DataBlock, FixedWidthDataBlock, BlockInfo};
use lance_encoding::buffer::LanceBuffer;
use lance_encoding::statistics::{ComputeStat, GetStat, Stat};
use arrow_array::types::UInt64Type;

// Create a data block and compute statistics
let values = vec![1u32, 1, 1, 2, 2, 3];
let buf = LanceBuffer::reinterpret_vec(values);
let mut block = FixedWidthDataBlock {
    data: buf,
    bits_per_value: 32,
    num_values: 6,
    block_info: BlockInfo::new(),
};

// Compute all statistics
block.compute_stat();

// Retrieve individual statistics
let run_count = block.expect_single_stat::<UInt64Type>(Stat::RunCount);
// run_count would be 3 (three distinct runs: [1,1,1], [2,2], [3])

let data_size = block.expect_single_stat::<UInt64Type>(Stat::DataSize);
// data_size would be 24 (6 values * 4 bytes)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment