Heuristic:Apache Spark Hardware Provisioning Tips
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Hardware_Provisioning |
| Last Updated | 2026-02-08 22:00 GMT |
Overview
Hardware provisioning heuristics: 4-8 disks per node without RAID, at most 75% of RAM to Spark, 10 Gigabit+ network, and 8-16 CPU cores per machine.
Description
Spark's distributed computing model has specific hardware requirements that differ from traditional database or web server deployments. The key insights relate to I/O parallelism (RAID hurts shuffle performance), memory partitioning (the OS and buffer cache need headroom), network bandwidth (often the real bottleneck once data is in memory), and CPU core density (Spark scales well with minimal thread contention).
Usage
Use these heuristics when provisioning new cluster hardware, sizing cloud instances for Spark workloads, or diagnosing performance bottlenecks in existing deployments. These guidelines apply to both standalone and Kubernetes deployments.
The Insight (Rule of Thumb)
- Disks: Use 4-8 disks per node, mounted separately (NOT in RAID). Mount with `noatime` option. Set `spark.local.dir` to a comma-separated list of all mount points.
- Memory: Allocate at most 75% of machine RAM to Spark executors. Leave the rest for the OS buffer cache and other processes.
- JVM Limit: Never allocate more than 200GB to a single JVM. Launch multiple executors per node instead (standalone mode does this automatically).
- Network: Use 10 Gigabit or higher network. Network is often the bottleneck for distributed reduce operations (groupBy, reduceBy, SQL joins) once data is in memory.
- CPU: 8-16 cores per machine minimum. Spark scales well due to minimal inter-thread sharing.
Reasoning
Disks: RAID (especially RAID 5/6) reduces I/O parallelism for Spark's shuffle operations because the RAID controller serializes writes and adds parity computation overhead. Spark's shuffle already distributes data across multiple partitions, so independent disks provide better aggregate throughput. The `noatime` mount option eliminates unnecessary inode access time updates on every read, which can be significant for the small random reads in shuffle fetch.
Memory: The 75% rule exists because the OS buffer cache provides free read caching for shuffle files and input data. Starving the OS of memory forces it to evict cached pages, causing additional disk I/O that Spark's memory manager cannot compensate for.
Network: When dataset size exceeds available RAM and shuffle operations are required, network bandwidth becomes the primary bottleneck. A 10 Gigabit network can transfer ~1GB/s, which means a 100GB shuffle takes ~100 seconds in transit alone.
CPU: Spark's in-memory execution model means CPU is only the bottleneck after data is already cached in RAM. The recommendation for 8-16 cores per machine reflects the typical ratio of compute-to-memory in modern workloads.
Code Evidence
From `docs/hardware-provisioning.md:42-50`:
Recommendation: 4-8 disks per node, WITHOUT RAID
Mount separately, not as RAID array
Use noatime option in Linux to reduce unnecessary writes
Set spark.local.dir to comma-separated list of mount points
From `docs/hardware-provisioning.md:52-68`:
Recommendation: 8 GB to hundreds of GB per machine
Allocate at most 75% of memory to Spark (rest for OS/buffer cache)
Java VM behavior > 200 GB RAM:
- Launch multiple executors per node instead of single large executor
- Standalone mode: worker launches multiple executors automatically
From `docs/hardware-provisioning.md:70-76`:
Recommendation: 10 Gigabit or higher network
Critical for distributed reduce apps (group-by, reduce-by, SQL joins)
Check shuffle data volume in monitoring UI at http://<driver>:4040
From `docs/hardware-provisioning.md:78-83`:
Recommendation: At least 8-16 cores per machine
Spark scales well due to minimal thread sharing
More cores needed if CPU-bound after data is in memory