Implementation:Apache Spark Distribution Placement
| Field | Value |
|---|---|
| Source | Doc Spark Standalone |
| Domains | Deployment |
| Type | Pattern Doc |
| Related | Principle:Apache_Spark_Cluster_Installation |
Overview
Pattern documentation for deploying Spark binary distributions to cluster nodes.
Description
Spark standalone clusters require the Spark distribution placed at the same SPARK_HOME path on every node. The conf/workers file lists worker hostnames (one per line). SSH password-less access must be configured from the master to all workers for remote daemon management.
The deployment involves three key steps:
- Software placement -- extracting the Spark distribution to an identical path on all nodes
- Inventory configuration -- populating the conf/workers file with all worker hostnames
- SSH setup -- establishing passwordless SSH from the master to every worker node
Usage
Use after downloading or building a Spark distribution, before starting cluster daemons. This is the foundational step that must be completed before any other cluster configuration or startup operation.
Code Reference
Source: docs/spark-standalone.md (L33-95). This is a deployment pattern, not a single script.
Key files:
| File | Purpose |
|---|---|
| conf/workers | One hostname per line, enumerating all worker nodes |
| conf/spark-env.sh | Environment variable overrides for the cluster |
I/O
| Direction | Description |
|---|---|
| Inputs | Spark distribution (built or downloaded), conf/workers file, SSH keys |
| Outputs | Identical SPARK_HOME on all nodes |
Examples
1. Download and extract the Spark distribution on all nodes:
tar -xzf spark-<version>-bin-hadoop3.tgz
2. Configure the workers inventory file:
echo "worker1
worker2
worker3" > conf/workers
3. Set up passwordless SSH to all workers:
ssh-copy-id worker1
ssh-copy-id worker2
ssh-copy-id worker3