Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Apache Spark Serialization Optimization

From Leeroopedia



Knowledge Sources
Domains Optimization, Serialization
Last Updated 2026-02-08 22:00 GMT

Overview

Use Kryo serialization instead of Java serialization for 10x speed improvement and more compact representation in network-intensive Spark applications.

Description

Spark supports two serialization frameworks: Java serialization (default for most operations) and Kryo serialization. Kryo is significantly faster (up to 10x) and produces more compact output, making it critical for applications that shuffle large amounts of data over the network. Since Spark 2.0, Kryo is the default for shuffling simple types (primitives, strings, arrays), but the full Kryo serializer is not the global default because it requires explicit class registration for best performance. Unregistered classes still work but waste bytes storing full class names.

Usage

Use this heuristic when your Spark application is network-bound (e.g., large shuffle operations in joins, groupBy, reduceByKey), when serialized task size is large (check driver logs for task sizes > 20KB), or when you want to reduce memory consumption of cached RDDs stored in serialized form.

The Insight (Rule of Thumb)

  • Action: Set `spark.serializer` to `org.apache.spark.serializer.KryoSerializer` and register your custom classes.
  • Value: `conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")` plus `conf.registerKryoClasses(Array(classOf[MyClass]))`.
  • Trade-off: Requires upfront class registration for maximum benefit. Unregistered classes work but include full class names, reducing the size advantage.
  • Buffer Sizing: If serialization fails with buffer overflow, increase `spark.kryoserializer.buffer` (default 64KB, max `spark.kryoserializer.buffer.max` = 64MB).

Reasoning

Java serialization is flexible (any `Serializable` class works automatically) but produces verbose output that includes full class metadata, field descriptors, and reference tracking. Kryo strips this overhead by using compact integer class IDs (when registered) and optimized encoding for common types. The 10x speed difference comes from both smaller serialized sizes (less I/O) and faster encode/decode paths.

For broadcast variables, the benefit compounds: a large lookup table sent to all executors is serialized once by the driver and deserialized on each executor. With Kryo, both the network transfer and deserialization are dramatically faster.

Tasks larger than 20KB in serialized form are a signal that the closure captures large objects. This should be addressed with broadcast variables regardless of serializer choice, but Kryo reduces the baseline overhead.

Code Evidence

From `docs/tuning.md:52-84`:

Kryo Serialization:
- 10x faster and more compact than Java serialization
- Requires class registration for best performance
- Default for shuffling simple types since Spark 2.0.0
- Not default overall due to registration requirement

Configuration:
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))

spark.kryoserializer.buffer must hold largest serialized object

Broadcast variable threshold from `docs/tuning.md:287-295`:

Use SparkContext.broadcast() for large objects used in tasks
- Tasks > 20 KiB probably worth optimizing with broadcast
- Check serialized task size in master logs
- Reduces serialized task size and job launch cost

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment