Environment:Heibaiying BigData Notes Spark 2 4 Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Batch_Processing, Stream_Processing |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Apache Spark 2.4.x environment with Scala 2.11/2.12, supporting Spark SQL, Structured API, and Spark Streaming for batch and stream data processing.
Description
This environment provides Apache Spark 2.4.x (versions 2.4.0 and 2.4.3 used across modules) with both Scala 2.11 and 2.12 binary variants. It includes Spark SQL for structured data analysis, the Structured API for DataFrame/Dataset operations, and Spark Streaming for real-time processing. Integration modules support Kafka (`spark-streaming-kafka-0-10_2.12:2.4.3`) and Flume (`spark-streaming-flume_2.11:2.4.3`).
Usage
Use this environment for Spark SQL Data Analysis and Spark Streaming operations. It is the mandatory prerequisite for the Spark SQL Data Analysis workflow and all Spark Streaming examples (basic, Kafka, Flume integrations).
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (CentOS 7.6 recommended) | Any Linux with JDK 8 |
| Java | JDK 1.8 | Required by Spark 2.4 |
| Scala | 2.11 or 2.12 | Depending on module variant |
| Hardware | Minimum 2GB RAM per Worker | Default allocates "all memory - 1GB" |
| Disk | 20GB+ | For shuffle data and temp files |
Dependencies
System Packages
- `spark` = 2.4.3 (pre-built for Hadoop 2.6)
- `java-1.8.0-openjdk-devel`
- `scala` = 2.12.8 (optional, for Scala shell)
Java/Scala Packages (Maven)
- `org.apache.spark:spark-streaming_2.12` = 2.4.3
- `org.apache.spark:spark-streaming-kafka-0-10_2.12` = 2.4.3 (for Kafka integration)
- `org.apache.spark:spark-streaming-flume_2.11` = 2.4.3 (for Flume integration)
- `redis.clients:jedis` = 2.9.0 (for Redis output)
- `com.thoughtworks.paranamer:paranamer` = 2.8
Environment Variables
- `SPARK_HOME` = Spark installation directory
- `PATH` includes `$SPARK_HOME/bin` and `$SPARK_HOME/sbin`
Credentials
No API credentials required for Spark itself. For Hadoop integration:
- `HADOOP_HOME` and `HADOOP_CONF_DIR` must be set for YARN deployment mode.
Quick Install
# Download Spark 2.4.3 pre-built for Hadoop 2.6
wget https://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.6.tgz
tar -xzf spark-2.4.3-bin-hadoop2.6.tgz -C /opt/
# Configure environment
export SPARK_HOME=/opt/spark-2.4.3-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Code Evidence
Spark Streaming dependency from `spark-streaming-basis/pom.xml`:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>2.4.3</version>
</dependency>
Spark Kafka integration from `spark-streaming-kafka/pom.xml`:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>2.4.3</version>
</dependency>
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Task not serializable` | Non-serializable object in closure | Use static lazy initialization for connection pools (e.g., Jedis) |
| `local-1234 is not enough` | Only 1 thread with local[1] | Use `local[2]` or higher for Spark Streaming (1 thread for receiver + N for processing) |
| `Application JAR not found` | JAR not accessible on cluster | Ensure JAR is on HDFS or same local path on all nodes |
Compatibility Notes
- YARN deployment: Both YARN and HDFS must be running; Spark uses HDFS for temporary files.
- Spark Streaming exclusion: Exclude `spark-streaming` from uber-JAR when deploying to cluster (already in Spark's `jars/` directory).
- Spark Standalone HA: Include all master addresses in URL format: `spark://HOST1:PORT1,HOST2:PORT2`.
- Worker memory: Default is "all available memory - 1GB"; configure explicitly for production.
Related Pages
- Implementation:Heibaiying_BigData_Notes_SparkSession_Builder
- Implementation:Heibaiying_BigData_Notes_Spark_Read_External_Data
- Implementation:Heibaiying_BigData_Notes_DataFrame_Transformation_API
- Implementation:Heibaiying_BigData_Notes_Spark_SQL_View_Registration
- Implementation:Heibaiying_BigData_Notes_Spark_Agg_and_Join_API
- Implementation:Heibaiying_BigData_Notes_Spark_Write_External_Data