Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Lance format Lance LegacyDictionaryEncoding

From Leeroopedia


Knowledge Sources
Domains Encoding, Legacy_Format
Last Updated 2026-02-08 19:33 GMT

Overview

The legacy dictionary encoding stores string and binary data using dictionary compression with separate indices and items buffers in the Lance v2.0 format.

Description

⚠️ DEPRECATED: This is legacy code from the Lance v1/v2.0 format, retained only for backward compatibility. See Lance_format_Lance_Warning_Deprecated_Legacy_Encodings.

This module implements dictionary encoding for the legacy (v2.0) Lance file format. DictionaryPageScheduler decodes pages by scheduling I/O for both an indices buffer and an items (dictionary) buffer. All dictionary items are decoded during scheduling, and then indices are used to look up values at decode time. It supports two modes: should_decode_dict=true expands dictionary entries back to their original string values; should_decode_dict=false preserves the dictionary encoding in the output (for DataType::Dictionary output types). The DictionaryEncoder on the encoding side builds a dictionary from string arrays using a HashMap for deduplication, producing UInt8 indices and the dictionary items. The AlreadyDictionaryEncoder handles data that is already dictionary-encoded, passing through the indices and re-encoding the dictionary values. The module includes DirectDictionaryPageDecoder for producing DictionaryDataBlock output without expansion.

Usage

Use this encoding for low-cardinality string columns where dictionary compression provides significant space savings. The CoreArrayEncodingStrategy selects dictionary encoding when cardinality analysis (using HyperLogLog++) shows the column has fewer than 100 unique values. During reading, DictionaryPageScheduler is created by the physical dispatch from Dictionary protobuf encoding.

Code Reference

Source Location

rust/lance-encoding/src/previous/encodings/physical/dictionary.rs

Signature

pub struct DictionaryPageScheduler {
    indices_scheduler: Arc<dyn PageScheduler>,
    items_scheduler: Arc<dyn PageScheduler>,
    num_dictionary_items: u32,
    should_decode_dict: bool,
}

impl DictionaryPageScheduler {
    pub fn new(
        indices_scheduler: Arc<dyn PageScheduler>,
        items_scheduler: Arc<dyn PageScheduler>,
        num_dictionary_items: u32,
        should_decode_dict: bool,
    ) -> Self;
}

pub struct DictionaryEncoder { /* fields omitted */ }
pub struct AlreadyDictionaryEncoder { /* fields omitted */ }

Import

use lance_encoding::previous::encodings::physical::dictionary::{
    DictionaryPageScheduler, DictionaryEncoder, AlreadyDictionaryEncoder,
};

I/O Contract

Input Type Description
indices_scheduler Arc<dyn PageScheduler> Scheduler for the dictionary index buffer
items_scheduler Arc<dyn PageScheduler> Scheduler for the dictionary items (values) buffer
num_dictionary_items u32 Number of unique entries in the dictionary
should_decode_dict bool Whether to expand dictionary to value type or keep encoded
data DataBlock Variable-width data to encode into a dictionary
Output Type Description
decoded DataBlock Expanded variable-width or dictionary data block
encoded EncodedArray Dictionary-encoded indices and items with descriptor

Usage Examples

use lance_encoding::previous::encodings::physical::dictionary::DictionaryPageScheduler;
use lance_encoding::decoder::PageScheduler;
use std::sync::Arc;

// Create a dictionary page scheduler
let indices_scheduler: Arc<dyn PageScheduler> = /* from dispatch */;
let items_scheduler: Arc<dyn PageScheduler> = /* from dispatch */;
let scheduler = DictionaryPageScheduler::new(
    indices_scheduler,
    items_scheduler,
    100,     // num_dictionary_items
    true,    // decode dictionary to string values
);

// Schedule ranges
let ranges = vec![0..500];
let io: Arc<dyn EncodingsIo> = /* from context */;
let decoder_fut = scheduler.schedule_ranges(&ranges, &io, 0);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment