Workflow:Apache Paimon Lance Format Analytics

Knowledge Sources	Apache Paimon Paimon Documentation Lance Format Ray Data Documentation
Domains	Data_Lake, Analytics, Columnar_Storage
Last Updated	2026-02-07 23:00 GMT

Overview

End-to-end process for creating Paimon tables backed by the Lance columnar format, enabling optimized analytical queries with efficient column projection, predicate pushdown, and Ray Data integration.

Description

This workflow leverages the Lance columnar file format as the storage backend for Paimon tables. Lance provides efficient random access reads, fast column projection, and optimized scan performance for analytical workloads. The workflow covers table creation with Lance format configuration, multi-batch data writing with primary key support, and reading via multiple output formats (Pandas, Arrow, Ray Dataset). It combines Paimon's transactional guarantees (snapshots, atomic commits) with Lance's columnar performance.

Usage

Execute this workflow when analytical query performance is a priority and your workload involves frequent column-selective reads, range scans, or integration with ML pipelines that benefit from Lance's efficient random access. Lance format is particularly useful for tables with many columns where queries typically access only a subset.

Execution Steps

Step 1: Catalog and Lance Table Creation

Create a Paimon catalog and define a table with the Lance file format option. Set the file format to Lance in the schema options, configure primary keys for upsert support, and set the bucket count for write distribution. The table schema uses standard Paimon data types, which are automatically mapped to Lance's type system.

Key considerations:

Set file format to Lance via table options
Primary keys enable merge-on-read semantics with Lance files
Bucket count affects the number of Lance files per partition
Lance supports all Paimon primitive types and nested structures

Step 2: Multi-batch Data Writing

Write data to the Lance table in multiple batches within a single write session. Each batch can be a Pandas DataFrame or PyArrow Table. The writer accumulates data files across batches, and a single commit at the end atomically publishes all written data.

Key considerations:

Multiple batches can be written before committing
Each batch produces one or more Lance data files
The commit is atomic across all batches
Primary key tables track keys for merge-on-read

Step 3: Multi-format Reading

Read the table using different output formats depending on the use case. Convert results to Pandas DataFrames for interactive analysis, PyArrow Tables for zero-copy processing, or Ray Datasets for distributed computation. Lance's columnar layout enables efficient reads for all formats.

Key considerations:

Pandas output materializes all data in memory
Arrow output provides zero-copy access to columnar data
Ray output distributes reads across workers
All formats benefit from Lance's columnar storage

Step 4: Predicate Pushdown on Lance Files

Apply filter predicates at the read builder level to push comparisons down to the Lance file reader. Supported predicates include equality, range comparisons, null checks, and compound expressions. The Lance reader evaluates predicates at the column chunk level, skipping irrelevant data pages.

Key considerations:

Predicates are evaluated during scan planning and file reading
Range predicates benefit from Lance's column statistics
Compound predicates (AND, OR) are supported
Pushdown reduces both I/O and memory usage

Step 5: Column Projection with Lance

Select specific columns at read time using the projection API. Lance's columnar layout means only the requested columns are read from storage, providing significant I/O savings for wide tables. This optimization is applied at the file reader level.

Key considerations:

Only projected columns are read from Lance files
Projection works with both predicate pushdown and Ray integration
Column order in the output matches the projection specification
Nested column projection is supported for struct types

Step 6: Distributed Analytics with Ray

Convert the Lance-backed table to a Ray Dataset for distributed analytical operations. Apply Ray's filter, map, groupby, and aggregation functions to process data in parallel. The combination of Lance's efficient column reads and Ray's distributed execution provides high-performance analytics.

Key considerations:

Ray blocks map to Paimon splits over Lance files
GroupBy operations trigger data shuffles across workers
Aggregation functions include sum, mean, min, max, and count
Results can be collected or streamed back to the driver

Execution Diagram

GitHub URL

Workflow Repository