Implementation:Lance format Lance LegacyDictionaryEncoding
| Knowledge Sources | |
|---|---|
| Domains | Encoding, Legacy_Format |
| Last Updated | 2026-02-08 19:33 GMT |
Overview
The legacy dictionary encoding stores string and binary data using dictionary compression with separate indices and items buffers in the Lance v2.0 format.
Description
⚠️ DEPRECATED: This is legacy code from the Lance v1/v2.0 format, retained only for backward compatibility. See Lance_format_Lance_Warning_Deprecated_Legacy_Encodings.
This module implements dictionary encoding for the legacy (v2.0) Lance file format. DictionaryPageScheduler decodes pages by scheduling I/O for both an indices buffer and an items (dictionary) buffer. All dictionary items are decoded during scheduling, and then indices are used to look up values at decode time. It supports two modes: should_decode_dict=true expands dictionary entries back to their original string values; should_decode_dict=false preserves the dictionary encoding in the output (for DataType::Dictionary output types). The DictionaryEncoder on the encoding side builds a dictionary from string arrays using a HashMap for deduplication, producing UInt8 indices and the dictionary items. The AlreadyDictionaryEncoder handles data that is already dictionary-encoded, passing through the indices and re-encoding the dictionary values. The module includes DirectDictionaryPageDecoder for producing DictionaryDataBlock output without expansion.
Usage
Use this encoding for low-cardinality string columns where dictionary compression provides significant space savings. The CoreArrayEncodingStrategy selects dictionary encoding when cardinality analysis (using HyperLogLog++) shows the column has fewer than 100 unique values. During reading, DictionaryPageScheduler is created by the physical dispatch from Dictionary protobuf encoding.
Code Reference
Source Location
rust/lance-encoding/src/previous/encodings/physical/dictionary.rs
Signature
pub struct DictionaryPageScheduler {
indices_scheduler: Arc<dyn PageScheduler>,
items_scheduler: Arc<dyn PageScheduler>,
num_dictionary_items: u32,
should_decode_dict: bool,
}
impl DictionaryPageScheduler {
pub fn new(
indices_scheduler: Arc<dyn PageScheduler>,
items_scheduler: Arc<dyn PageScheduler>,
num_dictionary_items: u32,
should_decode_dict: bool,
) -> Self;
}
pub struct DictionaryEncoder { /* fields omitted */ }
pub struct AlreadyDictionaryEncoder { /* fields omitted */ }
Import
use lance_encoding::previous::encodings::physical::dictionary::{
DictionaryPageScheduler, DictionaryEncoder, AlreadyDictionaryEncoder,
};
I/O Contract
| Input | Type | Description |
|---|---|---|
| indices_scheduler | Arc<dyn PageScheduler> |
Scheduler for the dictionary index buffer |
| items_scheduler | Arc<dyn PageScheduler> |
Scheduler for the dictionary items (values) buffer |
| num_dictionary_items | u32 |
Number of unique entries in the dictionary |
| should_decode_dict | bool |
Whether to expand dictionary to value type or keep encoded |
| data | DataBlock |
Variable-width data to encode into a dictionary |
| Output | Type | Description |
|---|---|---|
| decoded | DataBlock |
Expanded variable-width or dictionary data block |
| encoded | EncodedArray |
Dictionary-encoded indices and items with descriptor |
Usage Examples
use lance_encoding::previous::encodings::physical::dictionary::DictionaryPageScheduler;
use lance_encoding::decoder::PageScheduler;
use std::sync::Arc;
// Create a dictionary page scheduler
let indices_scheduler: Arc<dyn PageScheduler> = /* from dispatch */;
let items_scheduler: Arc<dyn PageScheduler> = /* from dispatch */;
let scheduler = DictionaryPageScheduler::new(
indices_scheduler,
items_scheduler,
100, // num_dictionary_items
true, // decode dictionary to string values
);
// Schedule ranges
let ranges = vec![0..500];
let io: Arc<dyn EncodingsIo> = /* from context */;
let decoder_fut = scheduler.schedule_ranges(&ranges, &io, 0);
Related Pages
- Lance_format_Lance_LegacyPhysicalDispatch - Creates DictionaryPageScheduler from protobuf
- Lance_format_Lance_LegacyEncoder - Strategy that selects dictionary encoding based on cardinality
- Lance_format_Lance_LegacyBinaryEncoding - Alternative encoding for high-cardinality strings
- Lance_format_Lance_LegacyValueEncoding - Encodes the dictionary indices buffer
- Lance_format_Lance_LegacyPrimitiveEncoding - Used to decode dictionary items
- Heuristic:Lance_format_Lance_Warning_Deprecated_Legacy_Encodings