Implementation:Apache Paimon Ray Dataset Operations
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Distributed_Computing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Wrapper documentation for Ray Data operations used to process Paimon table data in distributed mode.
Description
Ray Dataset provides filter(), map(), and groupby() methods for distributed data processing. When used with Paimon-sourced data, these operations process data that was loaded via to_ray(). filter() applies row-level predicates (complementing Paimon's pushdown filters with application-level logic), map() transforms individual records, and groupby().sum()/count()/etc. perform distributed aggregations.
Usage
Chain operations on the Ray Dataset returned by to_ray(). All operations return new Dataset instances (immutable), allowing fluent chaining.
Code Reference
Source Location
External Tool (Wrapper) - Ray Dataset API documentation
Signature
# Ray Data API (external)
class Dataset:
def filter(self, fn: Callable[[Dict], bool]) -> Dataset:
def map(self, fn: Callable[[Dict], Dict]) -> Dataset:
def groupby(self, key: str) -> GroupedData:
class GroupedData:
def sum(self, on: str) -> Dataset:
def count(self) -> Dataset:
def mean(self, on: str) -> Dataset:
Import
import ray.data
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| fn (filter) | Callable[[Dict], bool] | Yes | Lambda or function returning True to keep a row |
| fn (map) | Callable[[Dict], Dict] | Yes | Lambda or function returning a transformed row dict |
| key (groupby) | str | Yes | Column name to group by |
| on (sum/mean) | str | Yes | Column name to aggregate |
Outputs
| Name | Type | Description |
|---|---|---|
| (return) | ray.data.dataset.Dataset | Transformed distributed Ray Dataset |
Usage Examples
Basic Usage
# Filter rows
filtered = ray_dataset.filter(lambda row: row['value'] > 100)
# Transform rows
transformed = ray_dataset.map(lambda row: {
**row,
'value_doubled': row['value'] * 2,
})
# Aggregate
aggregated = ray_dataset.groupby('category').sum('value')