Principle:Microsoft LoRA Beam Search Generation
Overview
Beam Search Generation is the principle of using beam search decoding to generate high-quality text from a fine-tuned GPT-2 LoRA model. Beam search maintains multiple candidate sequences (beams) at each decoding step, expanding and pruning them based on cumulative log-probabilities, to find sequences that are more likely than those produced by greedy decoding.
Description
Beam Search Algorithm
Beam search is a breadth-limited search over the space of possible token sequences. At each decoding step:
- The model computes a probability distribution over the vocabulary for each active beam.
- Each beam is expanded by considering the top-scoring next tokens.
- The resulting
beam_size * vocab_sizecandidates are scored and the topbeam_sizeare retained. - Beams that generate an EOS token are added to a completed hypotheses pool.
- The search terminates when the maximum generation length is reached or all beams have completed.
The final output is the highest-scoring completed hypothesis, selected by comparing length-normalized log-probability scores.
Length Penalty
Raw beam search is biased toward shorter sequences because log-probabilities are always negative and longer sequences accumulate more negative values. The length penalty corrects this bias by dividing the beam score by sequence_length ^ length_penalty:
score = sum(log_probs) / length ^ length_penalty
length_penalty = 1.0normalizes by length (default).length_penalty > 1.0favors longer sequences.length_penalty < 1.0favors shorter sequences.
Repetition Penalty
Repetition penalty (from the CTRL paper, Keskar et al., 2019) discourages the model from repeating tokens that have already appeared in the generated sequence. For each previously generated token:
- If its log-probability is negative, it is multiplied by the repetition penalty factor (making it more negative, hence less likely).
- If its log-probability is positive, it is divided by the repetition penalty factor.
No-Repeat N-gram Blocking
A stricter form of repetition control, no-repeat n-gram blocking completely bans the generation of any n-gram that has already appeared in the output. At each step, the algorithm:
- Extracts all n-grams of size
no_repeat_ngram_sizefrom the generated history. - For each beam, identifies the (n-1)-gram ending at the current position.
- Sets the probability of any token that would complete a previously seen n-gram to
-infinity.
This is particularly effective for NLG tasks where table-to-text or triple-to-text generation can produce degenerate repetitive outputs.
Minimum Length Enforcement
To prevent premature termination, the minimum length parameter sets the EOS token probability to -infinity for all steps before the minimum length is reached, forcing the model to generate at least that many tokens.
Theoretical Basis
Beam search approximates the MAP (Maximum A Posteriori) decoding objective:
y* = argmax_y P(y | x) = argmax_y sum_t log P(y_t | y_{<t}, x)
Exact MAP decoding is intractable because the search space grows exponentially with sequence length (|V|^T for vocabulary size |V| and length T). Beam search with beam size B reduces this to O(B * |V| * T) by keeping only the top-B candidates at each step. While beam search does not guarantee finding the global optimum, it consistently produces higher-quality outputs than greedy decoding (B=1) for structured NLG tasks.
Metadata
| Field | Value |
|---|---|
| Source | microsoft/LoRA |
| Domains | Generation, NLG |
| Type | External Tool Doc |
| Last Updated | 2026-02-10 |