Heuristic:Lance format Lance Encoding Compression Thresholds
| Knowledge Sources | |
|---|---|
| Domains | Encoding, Optimization |
| Last Updated | 2026-02-08 19:00 GMT |
Overview
Lance encoding thresholds: 8KiB max miniblock size, 4096 max values per miniblock, 1024 elements per bitpacking chunk, and 4KiB minimum buffer for compression.
Description
Lance uses a custom columnar encoding system with multiple physical encodings (miniblock, bitpacking, RLE, FSST, Zstd, LZ4). Each encoding has carefully tuned size thresholds that balance compression ratio, decode speed, and disk I/O alignment. These constants are critical for file format compatibility and should not be changed without understanding the implications.
Usage
Apply this heuristic when understanding Lance file format internals, debugging encoding issues, or evaluating compression effectiveness. These are internal constants not typically user-configurable, but understanding them helps diagnose performance issues related to data layout.
The Insight (Rule of Thumb)
Miniblock Encoding
- MAX_MINIBLOCK_BYTES: 8,186 bytes (8 KiB - 6 bytes)
- Rationale: Fits within ~2 disk sectors. Total chunk (values + repetition/definition levels) stays under 24 KiB.
- MAX_MINIBLOCK_VALUES: 4,096 values per miniblock
- Rationale: Power-of-2 constraint for efficient indexing and alignment.
- MINIBLOCK_ALIGNMENT: 8 bytes
Bitpacking
- ELEMS_PER_CHUNK: 1,024 values per chunk (LOG_ELEMS_PER_CHUNK = 10)
- Rationale: Each chunk has constant bit width. 1,024 is a sweet spot for cache locality and compression effectiveness.
Compression Thresholds
- MIN_BUFFER_SIZE_FOR_COMPRESSION: 4 KiB
- Rationale: Buffers smaller than 4 KiB skip compression because the overhead is not worth the savings.
- DICT_SIZE_RATIO: 0.8 (80%)
- Rationale: Use dictionary encoding only if encoded size < 80% of raw size.
Binary Data
- DEFAULT_AIM_MINICHUNK_SIZE: 4 KiB (configurable via `LANCE_BINARY_MINIBLOCK_CHUNK_SIZE` env var)
RLE (Run-Length Encoding)
- 2,048 value cap per chunk (workaround for issue #4429)
- Rationale: Prevents memory issues with highly repetitive data.
Blob Data
- TARGET_SHARD_SIZE: 32 MiB
- Rationale: Balances memory usage vs I/O request frequency for large binary objects.
Reasoning
The miniblock size of ~8 KiB is designed for disk sector alignment (typical 4 KiB sectors). Keeping miniblocks small enables efficient random access within pages without reading excessive data. The 4,096 value limit per miniblock ensures predictable memory usage during decode. The 4 KiB compression threshold avoids the overhead of compression headers and algorithm initialization for tiny buffers where savings would be negligible. The 80% dictionary encoding threshold ensures dictionary encoding is only used when it provides meaningful compression.
Code Evidence
Miniblock constants from `rust/lance-encoding/src/encodings/logical/primitive/miniblock.rs:19-20`:
pub const MAX_MINIBLOCK_BYTES: u64 = 8 * 1024 - 6;
pub const MAX_MINIBLOCK_VALUES: u64 = 4096;
Bitpacking chunk size from `rust/lance-encoding/src/encodings/physical/bitpacking.rs`:
pub const LOG_ELEMS_PER_CHUNK: usize = 10;
pub const ELEMS_PER_CHUNK: usize = 1 << LOG_ELEMS_PER_CHUNK; // 1024
Blob target shard size from `rust/lance-encoding/src/encodings/logical/blob.rs`:
const TARGET_SHARD_SIZE: usize = 32 * 1024 * 1024; // 32 MiB