Heuristic:Apache Spark Partition Sizing Tips
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Partitioning |
| Last Updated | 2026-02-08 22:00 GMT |
Overview
Partition sizing heuristics: 2-3 tasks per CPU core, 128MB max per file partition, 200 default shuffle partitions (usually wrong), and Adaptive Query Execution (AQE) for automatic runtime optimization.
Description
Partition count is one of the most impactful tuning parameters in Spark. Too few partitions leads to memory pressure and underutilization of cores. Too many partitions creates excessive scheduling overhead and small-file problems. The default of 200 shuffle partitions is a compromise that is often too high for small datasets and too low for large ones. Since Spark 3.2.0, Adaptive Query Execution (AQE) is enabled by default and can dynamically coalesce shuffle partitions at runtime, largely eliminating the need for manual tuning.
Usage
Use these heuristics when tuning spark.sql.shuffle.partitions, spark.default.parallelism, or spark.sql.files.maxPartitionBytes. Also apply when encountering OOM errors during reduce operations (groupByKey, join), as increasing parallelism reduces per-task working set size.
The Insight (Rule of Thumb)
- Action: Set `spark.default.parallelism` to 2-3x the number of CPU cores across the cluster.
- Value: For SQL workloads, rely on AQE (`spark.sql.adaptive.enabled = true`) with a large initial `spark.sql.shuffle.partitions` (e.g., 2000) and let AQE coalesce.
- File Partitioning: `spark.sql.files.maxPartitionBytes` = 128MB (default). Over-estimate `spark.sql.files.openCostInBytes` (4MB default) to ensure small files get fast scheduling.
- Skew Handling: AQE detects skewed partitions when size > 5x median AND > 256MB, then splits them automatically.
- Trade-off: More partitions mean smaller per-task memory but more scheduling overhead. Tasks as short as 200ms are efficient due to JVM reuse.
AQE Parallelism First Mode
- Default: `spark.sql.adaptive.coalescePartitions.parallelismFirst = true`
- Effect: Ignores the 64MB advisory target, only respects 1MB minimum. Maximizes parallelism.
- When to change: Set to `false` on busy shared clusters where resource efficiency matters more than single-query speed.
Reasoning
Spark reuses the JVM across tasks within an executor, making task startup cost negligible (unlike Hadoop MapReduce). This means it is safe to use far more tasks than physical cores. More tasks directly reduce per-task working sets, which is the primary fix for OOM errors during reduce operations. The 200ms minimum task duration guideline balances parallelism against the irreducible scheduling overhead.
The AQE advisory partition size of 64MB represents a sweet spot: small enough to avoid memory pressure, large enough to amortize task scheduling overhead. The skew detection thresholds (5x median, >256MB) are calibrated to avoid false positives while catching genuinely problematic data distributions.
For file-based reading, the openCostInBytes parameter affects how eagerly Spark assigns small files to the same partition. Setting it higher causes more aggressive coalescing of small files, which is beneficial for datasets with many small files (common with streaming output or partitioned tables).
Code Evidence
From `docs/tuning.md:262-263`:
spark.default.parallelism to change default
Recommendation: 2-3 tasks per CPU core in cluster
From `docs/tuning.md:277-285`:
"Sometimes you get OutOfMemoryError not because RDDs don't fit,
but because working set of one task (e.g., reduce in groupByKey) was too large.
Fix: Increase level of parallelism
- Spark reuses executor JVM across tasks (low task launch cost)
- Can safely increase parallelism beyond number of cores
- Tasks as short as 200ms are efficient"
AQE configuration from `docs/sql-performance-tuning.md:266-309`:
spark.sql.adaptive.enabled = true (default since 3.2.0)
spark.sql.adaptive.coalescePartitions.enabled = true
spark.sql.adaptive.advisoryPartitionSizeInBytes = 64 MB
spark.sql.adaptive.coalescePartitions.parallelismFirst = true (default)
- Ignores 64MB target, only respects 1MB minimum
- Set to false on busy clusters for better resource utilization
Skew join thresholds from `docs/sql-performance-tuning.md:370-406`:
spark.sql.adaptive.skewJoin.enabled = true
spark.sql.adaptive.skewJoin.skewedPartitionFactor = 5.0
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes = 256 MB