Principle:Apache Paimon Lance Table Configuration
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Columnar_Storage |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Mechanism for configuring Paimon tables to use the Lance columnar file format for optimized analytical workloads.
Description
Lance is a modern columnar format designed for ML and analytical workloads with fast random access and efficient filtering. Paimon supports Lance as an alternative file format to the default (Avro/Parquet). Configuring a table with Lance format involves setting the file.format option to lance during schema creation. This enables Lance-specific optimizations including vectorized reads, predicate pushdown at the file level, and efficient column projection via the Lance file reader.
When creating a Lance-format table, the CoreOptions.FILE_FORMAT key is set to CoreOptions.FILE_FORMAT_LANCE in the schema options dictionary. This configuration is stored in the table metadata and instructs the Paimon read and write pipelines to use Lance-specific readers and writers for all subsequent operations on the table.
Usage
Use when creating tables for analytical or ML workloads that benefit from Lance's columnar format, especially when vector similarity search or fast random access is needed. Lance format is particularly well-suited for:
- Analytical queries that scan large volumes of columnar data with predicate pushdown
- Machine learning pipelines that require fast random access to training data
- Vector similarity search workloads that benefit from Lance's native vector index support
- Wide tables where column projection can significantly reduce I/O
Theoretical Basis
Columnar file formats store data by column rather than by row, enabling efficient column projection, compression, and vectorized processing. Lance extends this with random access capabilities and native vector index support.
In traditional row-oriented storage, reading a subset of columns still requires scanning entire rows. Columnar formats like Lance store each column contiguously on disk, so reading only the needed columns requires proportionally less I/O. Lance further improves on this by adding:
- Random access: O(1) lookup of any row by index, unlike Parquet which requires sequential scanning within row groups
- Vector indexing: Native support for approximate nearest neighbor (ANN) search on embedding columns
- Efficient appends: Append-optimized structure that avoids full file rewrites