Heuristic:Lm sys FastChat Conversation Splitting Token Buffer
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Processing |
| Last Updated | 2026-02-07 04:00 GMT |
Overview
Data processing heuristic that adds a +6 token buffer per conversation turn when estimating token lengths, and enforces even-turn alignment when splitting long conversations.
Description
When splitting long ShareGPT conversations to fit within the model's context window (default 2048 tokens), FastChat adds 6 extra tokens to each turn's tokenized length as a safety buffer. This accounts for conversation template tokens (role markers, separators, special tokens) that are added during training preprocessing but not present in the raw conversation text. Additionally, conversations are truncated to an even number of turns before splitting to ensure every split starts with a human message and ends with an assistant response.
Usage
Use this heuristic when processing conversation data for training or when debugging why training data has unexpected token lengths. The +6 buffer and even-turn enforcement prevent truncation of assistant responses and misaligned training pairs.
The Insight (Rule of Thumb)
- Action: Add 6 tokens to each turn's raw token count when estimating if a conversation fits within `max_length`.
- Value: +6 tokens per turn; truncate to even number of turns before processing.
- Trade-off: The buffer is conservative — it may cause some conversations to be split unnecessarily (losing a turn that could have fit). But this prevents the more harmful case of a conversation exceeding `max_length` after template tokens are added, which would cause truncation of the assistant's response.
Reasoning
The conversation template adds tokens that are not present in the raw message text:
- Role markers (e.g., `"USER:"`, `"ASSISTANT:"`) — typically 2-3 tokens each
- Separators between turns (e.g., ``, `\n`) — 1-2 tokens
- BOS/EOS tokens — 1-2 tokens
The +6 buffer covers the worst case across different conversation templates. The Vicuna template, for example, adds `"USER: "` and `" ASSISTANT: "` markers plus `""` separators.
The even-turn truncation ensures that every split conversation is a valid human-assistant exchange:
- `conversations = conversations[: len(conversations) // 2 * 2]` drops any trailing human message without a response
- `assert (end_idx - start_idx) % 2 == 0` in `make_sample()` enforces this invariant at split boundaries
- The `filter_invalid_roles()` function additionally validates that turns strictly alternate `human` → `gpt` → `human` → `gpt`
Code Evidence
Token buffer from `fastchat/data/split_long_conversation.py:35`:
for c in conversations:
length = len(tokenizer(c["value"]).input_ids) + 6
tokenized_lens.append(length)
Even-turn truncation from `fastchat/data/split_long_conversation.py:33`:
conversations = conversations[: len(conversations) // 2 * 2]
Even-turn assertion from `fastchat/data/split_long_conversation.py:19`:
def make_sample(sample, start_idx, end_idx):
assert (end_idx - start_idx) % 2 == 0
Role alternation validation from `fastchat/data/split_long_conversation.py:89-97`:
def filter_invalid_roles(content):
new_content = []
for i, c in enumerate(content):
roles = ["human", "gpt"]
...
for j, s in enumerate(c["conversations"]):
if s["from"] != roles[j % 2]:
valid = False
break
Pair-wise splitting from `fastchat/data/split_long_conversation.py:45-54`:
for i in range(0, len(conversations), 2):
tmp_len = tokenized_lens[i] + tokenized_lens[i + 1]
if cur_len + tmp_len > max_length:
new_samples.append(make_sample(sample, start_idx, i))
start_idx = i
cur_len = 0