Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon Ray Dataset Operations

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Distributed_Computing
Last Updated 2026-02-07 00:00 GMT

Overview

Wrapper documentation for Ray Data operations used to process Paimon table data in distributed mode.

Description

Ray Dataset provides filter(), map(), and groupby() methods for distributed data processing. When used with Paimon-sourced data, these operations process data that was loaded via to_ray(). filter() applies row-level predicates (complementing Paimon's pushdown filters with application-level logic), map() transforms individual records, and groupby().sum()/count()/etc. perform distributed aggregations.

Usage

Chain operations on the Ray Dataset returned by to_ray(). All operations return new Dataset instances (immutable), allowing fluent chaining.

Code Reference

Source Location

External Tool (Wrapper) - Ray Dataset API documentation

Signature

# Ray Data API (external)
class Dataset:
    def filter(self, fn: Callable[[Dict], bool]) -> Dataset:
    def map(self, fn: Callable[[Dict], Dict]) -> Dataset:
    def groupby(self, key: str) -> GroupedData:

class GroupedData:
    def sum(self, on: str) -> Dataset:
    def count(self) -> Dataset:
    def mean(self, on: str) -> Dataset:

Import

import ray.data

I/O Contract

Inputs

Name Type Required Description
fn (filter) Callable[[Dict], bool] Yes Lambda or function returning True to keep a row
fn (map) Callable[[Dict], Dict] Yes Lambda or function returning a transformed row dict
key (groupby) str Yes Column name to group by
on (sum/mean) str Yes Column name to aggregate

Outputs

Name Type Description
(return) ray.data.dataset.Dataset Transformed distributed Ray Dataset

Usage Examples

Basic Usage

# Filter rows
filtered = ray_dataset.filter(lambda row: row['value'] > 100)

# Transform rows
transformed = ray_dataset.map(lambda row: {
    **row,
    'value_doubled': row['value'] * 2,
})

# Aggregate
aggregated = ray_dataset.groupby('category').sum('value')

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment