Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:FlagOpen FlagEmbedding Length Sorted Batching

From Leeroopedia





Knowledge Sources
Domains Optimization, Inference
Last Updated 2026-02-09 21:00 GMT

Overview

Inference optimization that sorts input sequences by descending length before batching to minimize padding overhead.

Description

Before encoding, all tokenized inputs are sorted by their sequence length in descending order. This ensures that sequences of similar length are grouped together in the same batch. Since padding adds zeros to make all sequences in a batch the same length, grouping similar-length sequences dramatically reduces the total number of padding tokens. After encoding, the embeddings are re-ordered to match the original input order using the inverse permutation.

Usage

This heuristic is automatically applied during all inference calls (`encode()` and `compute_score()` methods). It activates whenever batch processing is used and provides the most benefit when:

  • Input sequences have high variance in length (e.g., mix of short queries and long passages)
  • Batch sizes are large
  • Models are sensitive to padding (encoder-only models with mean pooling)

The Insight (Rule of Thumb)

  • Action: Sort all tokenized inputs by `len(input_ids)` descending before creating batches.
  • Value: Reduces wasted computation on padding tokens. For heterogeneous-length inputs, this can reduce total tokens processed by 30-50%.
  • Trade-off: Requires O(n log n) sorting overhead and storing the permutation index for re-ordering. Both are negligible compared to model forward pass cost.

Reasoning

Consider a batch with sequences of length [10, 100, 15, 95]. Without sorting, the batch is padded to length 100, wasting significant computation on padding tokens for the length-10 and length-15 sequences. With sorting, the batch order becomes [100, 95, 15, 10], and adjacent batches naturally group similar lengths together, reducing total padding.

For rerankers, the sorting is done on the combined query+passage length since both are concatenated before encoding:

# From FlagEmbedding/inference/embedder/encoder_only/base.py:222-224
# sort by length for less padding
length_sorted_idx = np.argsort([-len(x['input_ids']) for x in all_inputs])
all_inputs_sorted = [all_inputs[i] for i in length_sorted_idx]

After encoding, the original order is restored:

# From FlagEmbedding/inference/embedder/encoder_only/base.py:271
# adjust the order of embeddings
all_embeddings = all_embeddings[np.argsort(length_sorted_idx)]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment