Principle:Microsoft LoRA Beam Search Generation

Overview

Beam Search Generation is the principle of using beam search decoding to generate high-quality text from a fine-tuned GPT-2 LoRA model. Beam search maintains multiple candidate sequences (beams) at each decoding step, expanding and pruning them based on cumulative log-probabilities, to find sequences that are more likely than those produced by greedy decoding.

Description

Beam Search Algorithm

Beam search is a breadth-limited search over the space of possible token sequences. At each decoding step:

The model computes a probability distribution over the vocabulary for each active beam.
Each beam is expanded by considering the top-scoring next tokens.
The resulting beam_size * vocab_size candidates are scored and the top beam_size are retained.
Beams that generate an EOS token are added to a completed hypotheses pool.
The search terminates when the maximum generation length is reached or all beams have completed.

The final output is the highest-scoring completed hypothesis, selected by comparing length-normalized log-probability scores.

Length Penalty

Raw beam search is biased toward shorter sequences because log-probabilities are always negative and longer sequences accumulate more negative values. The length penalty corrects this bias by dividing the beam score by sequence_length ^ length_penalty:

score = sum(log_probs) / length ^ length_penalty

length_penalty = 1.0 normalizes by length (default).
length_penalty > 1.0 favors longer sequences.
length_penalty < 1.0 favors shorter sequences.

Repetition Penalty

Repetition penalty (from the CTRL paper, Keskar et al., 2019) discourages the model from repeating tokens that have already appeared in the generated sequence. For each previously generated token:

If its log-probability is negative, it is multiplied by the repetition penalty factor (making it more negative, hence less likely).
If its log-probability is positive, it is divided by the repetition penalty factor.

No-Repeat N-gram Blocking

A stricter form of repetition control, no-repeat n-gram blocking completely bans the generation of any n-gram that has already appeared in the output. At each step, the algorithm:

Extracts all n-grams of size no_repeat_ngram_size from the generated history.
For each beam, identifies the (n-1)-gram ending at the current position.
Sets the probability of any token that would complete a previously seen n-gram to -infinity.

This is particularly effective for NLG tasks where table-to-text or triple-to-text generation can produce degenerate repetitive outputs.

Minimum Length Enforcement

To prevent premature termination, the minimum length parameter sets the EOS token probability to -infinity for all steps before the minimum length is reached, forcing the model to generate at least that many tokens.

Theoretical Basis

Beam search approximates the MAP (Maximum A Posteriori) decoding objective:

y* = argmax_y P(y | x) = argmax_y sum_t log P(y_t | y_{<t}, x)

Exact MAP decoding is intractable because the search space grows exponentially with sequence length (|V|^T for vocabulary size |V| and length T). Beam search with beam size B reduces this to O(B * |V| * T) by keeping only the top-B candidates at each step. While beam search does not guarantee finding the global optimum, it consistently produces higher-quality outputs than greedy decoding (B=1) for structured NLG tasks.

Metadata

Field	Value
Source	microsoft/LoRA
Domains	Generation, NLG
Type	External Tool Doc
Last Updated	2026-02-10

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment