Heuristic:Eventual Inc Daft Execution Config Tuning
| Knowledge Sources | |
|---|---|
| Domains | Performance Tuning, Configuration, Query Execution |
| Last Updated | 2026-02-08 15:30 GMT |
Overview
Daft exposes a rich set of execution configuration parameters in daft/context.py that control morsel sizing, scan task batching, join strategies, sort sampling, file output sizes, and aggregation behavior; most defaults are well-chosen, but three knobs -- default_morsel_size, scantask_max_parallel, and maintain_order -- are the highest-leverage tuning points for production workloads.
Description
The execution configuration lives in the DaftExecutionConfig object and is set via daft.set_execution_config(). Every parameter has a sensible default, but understanding what each controls is essential for performance-sensitive pipelines.
Morsel and Scan Task Sizing
- default_morsel_size: 131072 rows -- The number of rows processed per morsel in the native executor. Larger values improve throughput by amortizing per-morsel overhead; smaller values reduce peak memory usage.
- scan_tasks_min_size_bytes: 96 MB -- The minimum size of a scan task. Scan tasks smaller than this will be merged together.
- scan_tasks_max_size_bytes: 384 MB -- The maximum size of a scan task. Scan tasks larger than this will be split.
- max_sources_per_scan_task: 10 -- The maximum number of source files that can be merged into a single scan task.
Join Strategy
- broadcast_join_size_bytes_threshold: 10 MiB -- If one side of a join is estimated to be below this size, Daft uses a broadcast join instead of a shuffle join, which avoids expensive data redistribution.
Sort and Preview
- sample_size_for_sort: 20 -- The number of elements sampled per partition to determine sort boundaries. Increasing this improves sort quality at the cost of sampling overhead.
- num_preview_rows: 8 -- The number of rows displayed when previewing a DataFrame in a notebook or REPL.
File Output Sizing
- parquet_target_filesize: 512 MB -- Target size for output Parquet files.
- parquet_target_row_group_size: 128 MB -- Target size for individual Parquet row groups.
- parquet_inflation_factor: 3.0 -- Estimated ratio of in-memory size to on-disk Parquet file size. Used for planning how much data to buffer before flushing.
- csv_inflation_factor: 0.5 -- Equivalent inflation factor for CSV files.
- json_inflation_factor: 0.25 -- Equivalent inflation factor for JSON files.
Aggregation
- shuffle_aggregation_default_partitions: 200 -- Default number of output partitions for shuffle-based aggregations (Ray runner only).
- partial_aggregation_threshold: 10000 rows -- If a partial aggregation accumulates more than this many groups, it flushes early to avoid unbounded memory growth.
- high_cardinality_aggregation_threshold: 0.8 -- If the ratio of unique keys to total rows exceeds this threshold, Daft skips partial aggregation and falls through to a full shuffle aggregation.
I/O and Parallelism
- read_sql_partition_size_bytes: 512 MB -- Target partition size for SQL reads.
- scantask_max_parallel: 8 -- Maximum number of scan tasks executed in parallel. Can be overridden with the DAFT_SCANTASK_MAX_PARALLEL environment variable. Set to "auto" to use all available CPUs.
- pre_shuffle_merge_threshold: 1 GB -- Threshold for merging partitions before a shuffle operation.
Ordering and Progress
- maintain_order: True by default. When set to False, df.collect() may return rows in arbitrary order, which enables the executor to skip expensive ordering steps.
- DAFT_PROGRESS_BAR=0 (environment variable) disables the progress bar, which is useful for benchmarking since the progress bar adds measurable overhead.
Usage
Apply this heuristic when:
- A Daft pipeline is running slower than expected or consuming too much memory.
- You need to tune I/O parallelism for cloud storage (e.g., S3, GCS) reads.
- You are writing Parquet files and need to control output file sizes or row group sizes.
- You are performing large aggregations and hitting memory limits or observing slow partial aggregation.
- You are benchmarking and need deterministic, low-overhead execution.
The Insight (Rule of Thumb)
- Action: Start with default configuration values, then tune default_morsel_size (memory vs. throughput), scantask_max_parallel (I/O parallelism), and maintain_order (disable for unordered collections).
- Value: The defaults are designed for general-purpose workloads. Targeted tuning of these three parameters covers the vast majority of performance optimization scenarios.
- Trade-off: Increasing default_morsel_size improves throughput but raises peak memory usage. Increasing scantask_max_parallel improves I/O throughput but can overwhelm network bandwidth or storage backends. Setting maintain_order=False improves performance but sacrifices deterministic row ordering.
Reasoning
Daft's execution engine processes data in morsels -- fixed-size chunks that flow through the query plan. The morsel size directly controls the granularity of parallelism and memory allocation. A morsel size of 131072 rows (the default) strikes a balance for typical columnar data, but workloads with very wide rows (many columns or large string fields) may benefit from smaller morsels to keep memory bounded.
Scan task parallelism (scantask_max_parallel) controls how many files or file segments are read concurrently. The default of 8 is conservative; cloud object stores like S3 can handle much higher parallelism, so setting this to "auto" (via the DAFT_SCANTASK_MAX_PARALLEL environment variable) can significantly improve read throughput on machines with many cores.
The maintain_order flag is particularly impactful. When enabled (the default), Daft must preserve the insertion order of rows through every operator in the plan, which constrains scheduling flexibility. Disabling it allows the executor to process and emit partitions in whatever order completes first, which is strictly faster for workloads where row order is irrelevant (e.g., aggregations, writes).
The aggregation thresholds (partial_aggregation_threshold and high_cardinality_aggregation_threshold) are adaptive: Daft attempts a partial (pre-shuffle) aggregation first, but if it detects high cardinality (more than 80% unique keys), it abandons the partial strategy and falls through to a full shuffle. The 10000-row flush threshold prevents the partial hash table from growing without bound.