Implementation:Lance format Lance Encoding Block Statistics
| Knowledge Sources | |
|---|---|
| Domains | Encoding, Columnar_Data |
| Last Updated | 2026-02-08 19:33 GMT |
Overview
The Statistics module computes and retrieves per-block statistics (bit width, data size, cardinality, run count, etc.) that drive compression selection decisions in the Lance encoding pipeline.
Description
When data is accumulated into DataBlock instances, the statistics module computes metrics that help the compression strategy pick the best algorithm. Statistics are stored in each block's BlockInfo (a thread-safe HashMap<Stat, Arc<dyn Array>>).
Available Statistics (Stat enum):
| Stat | Description | Computed For |
|---|---|---|
BitWidth |
Maximum bit width per 1024-value chunk | FixedWidthDataBlock
|
DataSize |
Total byte size of the data | FixedWidthDataBlock, VariableWidthBlock, OpaqueBlock
|
Cardinality |
Approximate unique value count (HyperLogLog++) | VariableWidthBlock, FixedWidthDataBlock (64/128-bit)
|
FixedSize |
Whether all values have the same length | (reserved) |
NullCount |
Number of null values | AllNullDataBlock
|
MaxLength |
Maximum element byte length | VariableWidthBlock, FixedWidthDataBlock
|
RunCount |
Number of distinct runs (for RLE analysis) | FixedWidthDataBlock
|
BytePositionEntropy |
Entropy of byte position distribution (for BSS) | FixedWidthDataBlock
|
Key Traits:
ComputeStat-- Implemented by data block types. Thecompute_statmethod calculates all relevant statistics and stores them in the block'sBlockInfo. This is called once during block construction.GetStat-- Retrieves a previously computed statistic byStatkey. Panics if called before statistics are computed. Some statistics (like cardinality for 64-bit fixed-width) are computed lazily on first access.
Implementation Details:
- Bit width is computed per chunk of 1024 values using bitwise OR folding, which finds the maximum value in each chunk and derives the bit width from leading zeros.
- Cardinality uses the HyperLogLog++ algorithm with precision 4 and xxhash3 for hashing. This provides approximate distinct counts efficiently.
- Run count counts transitions between consecutive values, giving insight into how well RLE will compress.
- Byte position entropy measures entropy of byte values at each byte position within fixed-width elements, used to determine whether byte-stream-split encoding would be effective.
Usage
Use this module when:
- Implementing a compression strategy that needs data statistics to choose algorithms
- Accessing statistics from a
DataBlockto make encoding decisions - Adding new statistics for a new compression algorithm
Code Reference
| Source Location | rust/lance-encoding/src/statistics.rs
|
|---|---|
| Key Enum | Stat
|
| Key Traits | ComputeStat, GetStat
|
| Import | use lance_encoding::statistics::{Stat, ComputeStat, GetStat};
|
I/O Contract
ComputeStat Trait:
| Method | Input | Output | Notes |
|---|---|---|---|
compute_stat(&mut self) |
-- | -- | Populates block_info; must be called exactly once
|
GetStat Trait:
| Method | Input | Output | Notes |
|---|---|---|---|
get_stat(&self, stat) |
Stat |
Option<Arc<dyn Array>> |
Returns None if stat not available for this block type
|
expect_stat(&self, stat) |
Stat |
Arc<dyn Array> |
Panics if stat not available |
expect_single_stat::<T>(&self, stat) |
Stat |
T::Native |
Extracts single scalar value; panics if not exactly one value |
Statistics by Block Type:
| Block Type | Statistics Computed |
|---|---|
FixedWidthDataBlock |
DataSize, BitWidth, MaxLength, RunCount, BytePositionEntropy
|
VariableWidthBlock |
DataSize, Cardinality, MaxLength
|
OpaqueBlock |
DataSize
|
FixedSizeListBlock |
Delegates to child (with MaxLength scaled by dimension)
|
AllNullDataBlock |
NullCount, DataSize (always 0)
|
Usage Examples
use lance_encoding::data::{DataBlock, FixedWidthDataBlock, BlockInfo};
use lance_encoding::buffer::LanceBuffer;
use lance_encoding::statistics::{ComputeStat, GetStat, Stat};
use arrow_array::types::UInt64Type;
// Create a data block and compute statistics
let values = vec![1u32, 1, 1, 2, 2, 3];
let buf = LanceBuffer::reinterpret_vec(values);
let mut block = FixedWidthDataBlock {
data: buf,
bits_per_value: 32,
num_values: 6,
block_info: BlockInfo::new(),
};
// Compute all statistics
block.compute_stat();
// Retrieve individual statistics
let run_count = block.expect_single_stat::<UInt64Type>(Stat::RunCount);
// run_count would be 3 (three distinct runs: [1,1,1], [2,2], [3])
let data_size = block.expect_single_stat::<UInt64Type>(Stat::DataSize);
// data_size would be 24 (6 values * 4 bytes)
Related Pages
- Lance_format_Lance_DataBlock - Statistics are stored in each data block's
BlockInfo - Lance_format_Lance_Compression_Traits - The default compression strategy reads statistics to choose algorithms
- Lance_format_Lance_LanceBuffer - Data buffers from which statistics are computed