Implementation:Lance format Lance CoreEncoder
| Knowledge Sources | |
|---|---|
| Domains | Encoding, Columnar_Data |
| Last Updated | 2026-02-08 19:33 GMT |
Overview
The CoreEncoder module defines the top-level encoding traits and strategies (FieldEncoder, FieldEncodingStrategy, StructuralEncodingStrategy) for converting Arrow arrays into Lance's encoded page format, supporting both legacy (2.0) and structural (2.1+) encoding approaches.
Description
Lance file encoding is driven by a FieldEncodingStrategy that chooses which encoder to use for each field in the schema. The current default for 2.1+ files is StructuralEncodingStrategy, which builds a tree of encoders that mirror the structure of the data:
- Struct encoders strip off validity and delegate to child encoders
- List encoders strip off offsets and delegate to child encoders
- Primitive leaf encoders accumulate validity, offsets, and values, then use miniblock or fullzip encoding to create pages
Key Types:
FieldEncoder(trait) -- Buffers incoming Arrow arrays and emitsEncodeTaskfutures when enough data accumulates for a page. A single field may map to multiple output columns (e.g., struct fields).FieldEncodingStrategy(trait) -- Factory for creatingFieldEncoderinstances based on field type and metadata.StructuralEncodingStrategy-- The default strategy for 2.1+ files. Delegates compression to aCompressionStrategy.EncodedPage-- A page of encoded data containing buffers, encoding description, row count, and column index.EncodedColumn-- Column-level result containing column buffers, encoding metadata, and all pages.OutOfLineBuffers-- Tracks buffer positions for data stored outside of pages (e.g., large binary encoding).EncodingOptions-- Controls cache size per column (default 8 MiB), max page size (default 32 MiB), buffer alignment (default 64 bytes), and file version.
Encoding Flow:
FieldEncodingStrategy::create_field_encodercreates an encoder for a schema field- For each batch of data,
FieldEncoder::maybe_encodebuffers the data and may return encode tasks - When enough data is buffered, tasks are spawned to produce
EncodedPageinstances FieldEncoder::flushemits remaining pagesFieldEncoder::finishreturns final column metadata
Usage
Use this module when:
- Writing Lance files (the writer calls into the encoding strategy to encode each field)
- Implementing a custom encoding strategy for specialized data types
- Configuring encoding parameters (page size, compression, buffer alignment)
Code Reference
| Source Location | rust/lance-encoding/src/encoder.rs
|
|---|---|
| Key Traits | FieldEncoder, FieldEncodingStrategy
|
| Key Structs | StructuralEncodingStrategy, EncodedPage, EncodedColumn, EncodingOptions, OutOfLineBuffers
|
| Key Functions | default_encoding_strategy(version), default_encoding_strategy_with_params(version, params)
|
| Import | use lance_encoding::encoder::{FieldEncoder, EncodingOptions, default_encoding_strategy};
|
I/O Contract
FieldEncoder Trait Methods:
| Method | Input | Output | Description |
|---|---|---|---|
maybe_encode |
ArrayRef, &mut OutOfLineBuffers, RepDefBuilder, u64, u64 |
Result<Vec<EncodeTask>> |
Buffer data and optionally produce page encode tasks |
flush |
&mut OutOfLineBuffers |
Result<Vec<EncodeTask>> |
Flush remaining buffered data into pages |
finish |
&mut OutOfLineBuffers |
BoxFuture<Result<Vec<EncodedColumn>>> |
Finalize and return column metadata |
num_columns |
-- | u32 |
Number of output columns this field produces |
EncodedPage Fields:
| Field | Type | Description |
|---|---|---|
data |
Vec<LanceBuffer> |
The encoded page buffers |
description |
PageEncoding |
Encoding metadata for decoding |
num_rows |
u64 |
Number of rows in the page |
row_number |
u64 |
Top-level row number of the first row |
column_idx |
u32 |
Column index in the file |
EncodingOptions Fields:
| Field | Type | Default | Description |
|---|---|---|---|
cache_bytes_per_column |
u64 |
8 MiB | Bytes to buffer before writing a page |
max_page_bytes |
u64 |
32 MiB | Maximum page size before splitting |
keep_original_array |
bool |
true |
Whether to deep-copy arrays before caching |
buffer_alignment |
u64 |
64 | Alignment for page buffers |
version |
LanceFileVersion |
default | Target file format version |
Usage Examples
use lance_encoding::encoder::{
default_encoding_strategy, ColumnIndexSequence, EncodingOptions,
};
use lance_encoding::version::LanceFileVersion;
use lance_core::datatypes::Field;
// Create encoding strategy for the latest version
let version = LanceFileVersion::default();
let strategy = default_encoding_strategy(version);
// Create an encoder for a specific field
let lance_field = Field::try_from(&arrow_field).unwrap();
let mut col_idx = ColumnIndexSequence::default();
let options = EncodingOptions::default();
let encoder = strategy.create_field_encoder(
strategy.as_ref(),
&lance_field,
&mut col_idx,
&options,
).unwrap();
Related Pages
- Lance_format_Lance_BatchDecodeStream - The read-side counterpart that decodes pages produced by encoders
- Lance_format_Lance_Compression_Traits -
CompressionStrategyused byStructuralEncodingStrategy - Lance_format_Lance_RepDef -
RepDefBuilderpassed through the encoding pipeline to track validity and offsets - Lance_format_Lance_EncodingFormat - Protobuf types used to describe encodings in page metadata
- Lance_format_Lance_LanceBuffer - The buffer type stored in
EncodedPage - Lance_format_Lance_DataBlock - Intermediate physical representation used during encoding