Principle:Apache Flink Locality Aware Split Assignment
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Performance_Optimization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A data-locality-aware scheduling strategy that preferentially assigns file splits to reader tasks co-located with the data blocks on the same physical host.
Description
Locality Aware Split Assignment optimizes data reading performance by minimizing network data transfer. When a reader requests a split, the assigner first checks if any unassigned splits have data blocks hosted on the requesting readers machine. If a local split exists, it is assigned preferentially. If no local split is available, a remote split is assigned using a round-robin strategy to maintain balanced work distribution.
This principle is particularly important for HDFS-based deployments where data locality can significantly reduce network I/O and improve throughput.
Usage
Use this principle when reading from distributed filesystems (HDFS, S3 with locality hints) where minimizing network transfer matters. For local filesystems or cloud storage without locality, locality-aware assignment degrades gracefully to round-robin.
Theoretical Basis
// Abstract algorithm
function assignSplit(requestingHost):
if requestingHost is null:
return getRemoteSplit() // round-robin fallback
localSplits = findSplitsOnHost(requestingHost)
if localSplits is not empty:
return localSplits.getMinLocalCount() // least-assigned local split
return getRemoteSplit() // fallback to remote