Implementation:Eventual Inc Daft Set Runner Ray
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Distributed_Computing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for configuring distributed Ray execution provided by the Daft library, wrapping the Ray distributed computing framework.
Description
The set_runner_ray function configures Daft to use the Ray distributed computing framework for executing DataFrame operations. Once configured, all subsequent DataFrame operations are distributed across the Ray cluster. It can connect to an existing Ray cluster via an address, start a local Ray instance, or operate in client mode. The function returns a configured Runner object. This is a wrapper around the Ray framework, requiring the ray package as an external dependency.
Usage
Call daft.set_runner_ray() once at the beginning of your program to enable distributed execution. This can also be configured via the DAFT_RUNNER=ray environment variable.
Code Reference
Source Location
- Repository: Daft
- File:
daft/runners/__init__.py - Lines: L51-76
Signature
def set_runner_ray(
address: str | None = None,
noop_if_initialized: bool = False,
max_task_backlog: int | None = None,
force_client_mode: bool = False,
) -> Runner[PartitionT]
Import
import daft
# Configure Daft to use Ray runner
daft.set_runner_ray()
# Connect to specific Ray cluster
daft.set_runner_ray(address="ray://cluster-head:10001")
External Dependency
- Package:
ray(Ray documentation)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| address | str or None | No | Ray cluster address to connect to. If None, connects to or starts a local Ray instance. |
| noop_if_initialized | bool | No | If True, skip initialization if Ray is already running. Defaults to False. |
| max_task_backlog | int or None | No | Maximum number of tasks that can be queued. None means Daft will automatically determine a good default. |
| force_client_mode | bool | No | If True, forces Ray to run in client mode. Defaults to False. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Runner[PartitionT] | A configured Runner object with Ray execution settings |
Usage Examples
Basic Usage
import daft
# Enable Ray distributed execution
daft.set_runner_ray()
# All subsequent DataFrame operations will use Ray
df = daft.read_parquet("s3://bucket/large-dataset/")
result = df.where(daft.col("value") > 100).collect()
Connect to Remote Cluster
import daft
# Connect to a specific Ray cluster
daft.set_runner_ray(address="ray://cluster-head:10001")
# Process data across the distributed cluster
df = daft.read_parquet("s3://bucket/data/")
result = df.groupby("category").agg(daft.col("amount").sum()).collect()
Related Pages
Implements Principle
Requires Environment
- Environment:Eventual_Inc_Daft_Python_PyArrow_Core
- Environment:Eventual_Inc_Daft_Ray_Distributed_Runner