Heuristic:Apache Spark Serialization Optimization
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Serialization |
| Last Updated | 2026-02-08 22:00 GMT |
Overview
Use Kryo serialization instead of Java serialization for 10x speed improvement and more compact representation in network-intensive Spark applications.
Description
Spark supports two serialization frameworks: Java serialization (default for most operations) and Kryo serialization. Kryo is significantly faster (up to 10x) and produces more compact output, making it critical for applications that shuffle large amounts of data over the network. Since Spark 2.0, Kryo is the default for shuffling simple types (primitives, strings, arrays), but the full Kryo serializer is not the global default because it requires explicit class registration for best performance. Unregistered classes still work but waste bytes storing full class names.
Usage
Use this heuristic when your Spark application is network-bound (e.g., large shuffle operations in joins, groupBy, reduceByKey), when serialized task size is large (check driver logs for task sizes > 20KB), or when you want to reduce memory consumption of cached RDDs stored in serialized form.
The Insight (Rule of Thumb)
- Action: Set `spark.serializer` to `org.apache.spark.serializer.KryoSerializer` and register your custom classes.
- Value: `conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")` plus `conf.registerKryoClasses(Array(classOf[MyClass]))`.
- Trade-off: Requires upfront class registration for maximum benefit. Unregistered classes work but include full class names, reducing the size advantage.
- Buffer Sizing: If serialization fails with buffer overflow, increase `spark.kryoserializer.buffer` (default 64KB, max `spark.kryoserializer.buffer.max` = 64MB).
Reasoning
Java serialization is flexible (any `Serializable` class works automatically) but produces verbose output that includes full class metadata, field descriptors, and reference tracking. Kryo strips this overhead by using compact integer class IDs (when registered) and optimized encoding for common types. The 10x speed difference comes from both smaller serialized sizes (less I/O) and faster encode/decode paths.
For broadcast variables, the benefit compounds: a large lookup table sent to all executors is serialized once by the driver and deserialized on each executor. With Kryo, both the network transfer and deserialization are dramatically faster.
Tasks larger than 20KB in serialized form are a signal that the closure captures large objects. This should be addressed with broadcast variables regardless of serializer choice, but Kryo reduces the baseline overhead.
Code Evidence
From `docs/tuning.md:52-84`:
Kryo Serialization:
- 10x faster and more compact than Java serialization
- Requires class registration for best performance
- Default for shuffling simple types since Spark 2.0.0
- Not default overall due to registration requirement
Configuration:
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
spark.kryoserializer.buffer must hold largest serialized object
Broadcast variable threshold from `docs/tuning.md:287-295`:
Use SparkContext.broadcast() for large objects used in tasks
- Tasks > 20 KiB probably worth optimizing with broadcast
- Check serialized task size in master logs
- Reduces serialized task size and job launch cost