Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Princeton nlp SimPO BOS Token Handling

From Leeroopedia




Knowledge Sources
Domains NLP, Tokenization, Debugging
Last Updated 2026-02-08 05:00 GMT

Overview

Ensure exactly one BOS token in prompts during SimPO training and evaluation to prevent duplicate BOS issues that degrade model performance.

Description

Llama-3's updated tokenizer (post-PR on HuggingFace) combined with vLLM occasionally introduces two BOS tokens at the start of a prompt, which can silently degrade evaluation results. The SimPO codebase explicitly handles BOS token deduplication in both the tokenizer setup and the trainer's tokenize_row method. During tokenization, BOS is prepended only if it is not already present. During chat template application, leading BOS tokens are stripped from chosen/rejected responses to prevent duplication when prompt and response are concatenated.

Usage

Use this heuristic when applying chat templates for SimPO training data or when evaluating trained models on benchmarks like AlpacaEval 2 and Arena-Hard. Check BOS handling whenever switching tokenizer versions or models.

The Insight (Rule of Thumb)

  • Action: Strip the BOS token from the start of chosen/rejected responses after applying the chat template, before concatenation with the prompt.
  • Action: In tokenize_row, add BOS to prompt only if not already present: check `bos_token_id != prompt_tokens["prompt_input_ids"][0]`.
  • Action: For Llama-3 evaluation, use the pre-update tokenizer (before the HuggingFace PR) to avoid double-BOS issues with vLLM.
  • Value: Exactly 1 BOS token at the start of every sequence.
  • Trade-off: None. This is a correctness requirement, not an optimization.

Reasoning

Tokenizer chat templates often prepend BOS automatically. When the training code also prepends BOS, sequences end up with two BOS tokens. This creates a mismatch between training and inference, or between different evaluation setups. The SimPO authors explicitly warn about this: "the updated Llama3 tokenizer with vLLM occasionally introduces two BOS tokens, which can affect evaluation results."

Code evidence from `scripts/run_simpo.py:107-111`:

example["text_chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
if example["text_chosen"].startswith(tokenizer.bos_token):
    example["text_chosen"] = example["text_chosen"][len(tokenizer.bos_token):]
example["text_rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
if example["text_rejected"].startswith(tokenizer.bos_token):
    example["text_rejected"] = example["text_rejected"][len(tokenizer.bos_token):]

Code evidence from `scripts/simpo_trainer.py:406-415`:

# add BOS token to head of prompt. Avoid adding if it's already there
bos_token_id = self.tokenizer.bos_token_id
if prompt_len_input_ids == 0 or bos_token_id != prompt_tokens["prompt_input_ids"][0]:
    prompt_tokens["prompt_input_ids"] = [bos_token_id] + prompt_tokens["prompt_input_ids"]
    prompt_tokens["prompt_attention_mask"] = [1] + prompt_tokens["prompt_attention_mask"]
if chosen_prompt_len_input_ids == 0 or bos_token_id != chosen_tokens["prompt_input_ids"][0]:
    chosen_tokens["prompt_input_ids"] = [bos_token_id] + chosen_tokens["prompt_input_ids"]
    chosen_tokens["prompt_attention_mask"] = [1] + chosen_tokens["prompt_attention_mask"]
if rejected_prompt_len_input_ids == 0 or bos_token_id != rejected_tokens["prompt_input_ids"][0]:
    rejected_tokens["prompt_input_ids"] = [bos_token_id] + rejected_tokens["prompt_input_ids"]
    rejected_tokens["prompt_attention_mask"] = [1] + rejected_tokens["prompt_attention_mask"]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment