Heuristic:Lm sys FastChat Conversation Splitting Token Buffer

Knowledge Sources	lm-sys/FastChat Data pipeline tokenization
Domains	NLP, Data_Processing
Last Updated	2026-02-07 04:00 GMT

Overview

Data processing heuristic that adds a +6 token buffer per conversation turn when estimating token lengths, and enforces even-turn alignment when splitting long conversations.

Description

When splitting long ShareGPT conversations to fit within the model's context window (default 2048 tokens), FastChat adds 6 extra tokens to each turn's tokenized length as a safety buffer. This accounts for conversation template tokens (role markers, separators, special tokens) that are added during training preprocessing but not present in the raw conversation text. Additionally, conversations are truncated to an even number of turns before splitting to ensure every split starts with a human message and ends with an assistant response.

Usage

Use this heuristic when processing conversation data for training or when debugging why training data has unexpected token lengths. The +6 buffer and even-turn enforcement prevent truncation of assistant responses and misaligned training pairs.

The Insight (Rule of Thumb)

Action: Add 6 tokens to each turn's raw token count when estimating if a conversation fits within `max_length`.
Value: +6 tokens per turn; truncate to even number of turns before processing.
Trade-off: The buffer is conservative — it may cause some conversations to be split unnecessarily (losing a turn that could have fit). But this prevents the more harmful case of a conversation exceeding `max_length` after template tokens are added, which would cause truncation of the assistant's response.

Reasoning

The conversation template adds tokens that are not present in the raw message text:

Role markers (e.g., `"USER:"`, `"ASSISTANT:"`) — typically 2-3 tokens each
Separators between turns (e.g., ``, `\n`) — 1-2 tokens
BOS/EOS tokens — 1-2 tokens

The +6 buffer covers the worst case across different conversation templates. The Vicuna template, for example, adds `"USER: "` and `" ASSISTANT: "` markers plus `""` separators.

The even-turn truncation ensures that every split conversation is a valid human-assistant exchange:

`conversations = conversations[: len(conversations) // 2 * 2]` drops any trailing human message without a response
`assert (end_idx - start_idx) % 2 == 0` in `make_sample()` enforces this invariant at split boundaries
The `filter_invalid_roles()` function additionally validates that turns strictly alternate `human` → `gpt` → `human` → `gpt`

Code Evidence

Token buffer from `fastchat/data/split_long_conversation.py:35`:

for c in conversations:
    length = len(tokenizer(c["value"]).input_ids) + 6
    tokenized_lens.append(length)

Even-turn truncation from `fastchat/data/split_long_conversation.py:33`:

conversations = conversations[: len(conversations) // 2 * 2]

Even-turn assertion from `fastchat/data/split_long_conversation.py:19`:

def make_sample(sample, start_idx, end_idx):
    assert (end_idx - start_idx) % 2 == 0

Role alternation validation from `fastchat/data/split_long_conversation.py:89-97`:

def filter_invalid_roles(content):
    new_content = []
    for i, c in enumerate(content):
        roles = ["human", "gpt"]
        ...
        for j, s in enumerate(c["conversations"]):
            if s["from"] != roles[j % 2]:
                valid = False
                break

Pair-wise splitting from `fastchat/data/split_long_conversation.py:45-54`:

for i in range(0, len(conversations), 2):
    tmp_len = tokenized_lens[i] + tokenized_lens[i + 1]
    if cur_len + tmp_len > max_length:
        new_samples.append(make_sample(sample, start_idx, i))
        start_idx = i
        cur_len = 0

Related Pages

Implementation:Lm_sys_FastChat_Split_Long_Conversation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment