Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Heibaiying BigData Notes Spark 2 4 Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Batch_Processing, Stream_Processing
Last Updated 2026-02-10 10:00 GMT

Overview

Apache Spark 2.4.x environment with Scala 2.11/2.12, supporting Spark SQL, Structured API, and Spark Streaming for batch and stream data processing.

Description

This environment provides Apache Spark 2.4.x (versions 2.4.0 and 2.4.3 used across modules) with both Scala 2.11 and 2.12 binary variants. It includes Spark SQL for structured data analysis, the Structured API for DataFrame/Dataset operations, and Spark Streaming for real-time processing. Integration modules support Kafka (`spark-streaming-kafka-0-10_2.12:2.4.3`) and Flume (`spark-streaming-flume_2.11:2.4.3`).

Usage

Use this environment for Spark SQL Data Analysis and Spark Streaming operations. It is the mandatory prerequisite for the Spark SQL Data Analysis workflow and all Spark Streaming examples (basic, Kafka, Flume integrations).

System Requirements

Category Requirement Notes
OS Linux (CentOS 7.6 recommended) Any Linux with JDK 8
Java JDK 1.8 Required by Spark 2.4
Scala 2.11 or 2.12 Depending on module variant
Hardware Minimum 2GB RAM per Worker Default allocates "all memory - 1GB"
Disk 20GB+ For shuffle data and temp files

Dependencies

System Packages

  • `spark` = 2.4.3 (pre-built for Hadoop 2.6)
  • `java-1.8.0-openjdk-devel`
  • `scala` = 2.12.8 (optional, for Scala shell)

Java/Scala Packages (Maven)

  • `org.apache.spark:spark-streaming_2.12` = 2.4.3
  • `org.apache.spark:spark-streaming-kafka-0-10_2.12` = 2.4.3 (for Kafka integration)
  • `org.apache.spark:spark-streaming-flume_2.11` = 2.4.3 (for Flume integration)
  • `redis.clients:jedis` = 2.9.0 (for Redis output)
  • `com.thoughtworks.paranamer:paranamer` = 2.8

Environment Variables

  • `SPARK_HOME` = Spark installation directory
  • `PATH` includes `$SPARK_HOME/bin` and `$SPARK_HOME/sbin`

Credentials

No API credentials required for Spark itself. For Hadoop integration:

  • `HADOOP_HOME` and `HADOOP_CONF_DIR` must be set for YARN deployment mode.

Quick Install

# Download Spark 2.4.3 pre-built for Hadoop 2.6
wget https://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.6.tgz
tar -xzf spark-2.4.3-bin-hadoop2.6.tgz -C /opt/

# Configure environment
export SPARK_HOME=/opt/spark-2.4.3-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Code Evidence

Spark Streaming dependency from `spark-streaming-basis/pom.xml`:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.12</artifactId>
    <version>2.4.3</version>
</dependency>

Spark Kafka integration from `spark-streaming-kafka/pom.xml`:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
    <version>2.4.3</version>
</dependency>

Common Errors

Error Message Cause Solution
`Task not serializable` Non-serializable object in closure Use static lazy initialization for connection pools (e.g., Jedis)
`local-1234 is not enough` Only 1 thread with local[1] Use `local[2]` or higher for Spark Streaming (1 thread for receiver + N for processing)
`Application JAR not found` JAR not accessible on cluster Ensure JAR is on HDFS or same local path on all nodes

Compatibility Notes

  • YARN deployment: Both YARN and HDFS must be running; Spark uses HDFS for temporary files.
  • Spark Streaming exclusion: Exclude `spark-streaming` from uber-JAR when deploying to cluster (already in Spark's `jars/` directory).
  • Spark Standalone HA: Include all master addresses in URL format: `spark://HOST1:PORT1,HOST2:PORT2`.
  • Worker memory: Default is "all available memory - 1GB"; configure explicitly for production.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment