Environment:Heibaiying BigData Notes Spark 2 4 Environment

Knowledge Sources	BigData-Notes Apache Spark
Domains	Infrastructure, Batch_Processing, Stream_Processing
Last Updated	2026-02-10 10:00 GMT

Overview

Apache Spark 2.4.x environment with Scala 2.11/2.12, supporting Spark SQL, Structured API, and Spark Streaming for batch and stream data processing.

Description

This environment provides Apache Spark 2.4.x (versions 2.4.0 and 2.4.3 used across modules) with both Scala 2.11 and 2.12 binary variants. It includes Spark SQL for structured data analysis, the Structured API for DataFrame/Dataset operations, and Spark Streaming for real-time processing. Integration modules support Kafka (`spark-streaming-kafka-0-10_2.12:2.4.3`) and Flume (`spark-streaming-flume_2.11:2.4.3`).

Usage

Use this environment for Spark SQL Data Analysis and Spark Streaming operations. It is the mandatory prerequisite for the Spark SQL Data Analysis workflow and all Spark Streaming examples (basic, Kafka, Flume integrations).

System Requirements

Category	Requirement	Notes
OS	Linux (CentOS 7.6 recommended)	Any Linux with JDK 8
Java	JDK 1.8	Required by Spark 2.4
Scala	2.11 or 2.12	Depending on module variant
Hardware	Minimum 2GB RAM per Worker	Default allocates "all memory - 1GB"
Disk	20GB+	For shuffle data and temp files

Dependencies

System Packages

`spark` = 2.4.3 (pre-built for Hadoop 2.6)
`java-1.8.0-openjdk-devel`
`scala` = 2.12.8 (optional, for Scala shell)

Java/Scala Packages (Maven)

`org.apache.spark:spark-streaming_2.12` = 2.4.3
`org.apache.spark:spark-streaming-kafka-0-10_2.12` = 2.4.3 (for Kafka integration)
`org.apache.spark:spark-streaming-flume_2.11` = 2.4.3 (for Flume integration)
`redis.clients:jedis` = 2.9.0 (for Redis output)
`com.thoughtworks.paranamer:paranamer` = 2.8

Environment Variables

`SPARK_HOME` = Spark installation directory
`PATH` includes `$SPARK_HOME/bin` and `$SPARK_HOME/sbin`

Credentials

No API credentials required for Spark itself. For Hadoop integration:

`HADOOP_HOME` and `HADOOP_CONF_DIR` must be set for YARN deployment mode.

Quick Install

# Download Spark 2.4.3 pre-built for Hadoop 2.6
wget https://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.6.tgz
tar -xzf spark-2.4.3-bin-hadoop2.6.tgz -C /opt/

# Configure environment
export SPARK_HOME=/opt/spark-2.4.3-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Code Evidence

Spark Streaming dependency from `spark-streaming-basis/pom.xml`:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.12</artifactId>
    <version>2.4.3</version>
</dependency>

Spark Kafka integration from `spark-streaming-kafka/pom.xml`:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
    <version>2.4.3</version>
</dependency>

Common Errors

Error Message	Cause	Solution
`Task not serializable`	Non-serializable object in closure	Use static lazy initialization for connection pools (e.g., Jedis)
`local-1234 is not enough`	Only 1 thread with local[1]	Use `local[2]` or higher for Spark Streaming (1 thread for receiver + N for processing)
`Application JAR not found`	JAR not accessible on cluster	Ensure JAR is on HDFS or same local path on all nodes

Compatibility Notes

YARN deployment: Both YARN and HDFS must be running; Spark uses HDFS for temporary files.
Spark Streaming exclusion: Exclude `spark-streaming` from uber-JAR when deploying to cluster (already in Spark's `jars/` directory).
Spark Standalone HA: Include all master addresses in URL format: `spark://HOST1:PORT1,HOST2:PORT2`.
Worker memory: Default is "all available memory - 1GB"; configure explicitly for production.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment