Principle:Lm sys FastChat Long Conversation Splitting
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Long Conversation Splitting |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source Code Analysis, API Documentation |
| Domains | Data Preprocessing, NLP Pipeline, Tokenization, Context Window Management |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Long Conversation Splitting is a data preprocessing principle in the FastChat ShareGPT Data Pipeline that addresses the fundamental constraint of model context windows. Transformer-based language models have a fixed maximum sequence length (e.g., 2048 or 4096 tokens), and training conversations that exceed this limit must be split into smaller sub-conversations. This principle governs how long conversations are divided at turn boundaries while preserving conversational coherence and maintaining valid role alternation.
Description
Token-Based Length Calculation
Rather than splitting by character count or word count, this principle uses token-based length measurement. Each conversation turn is tokenized using the target model's tokenizer (e.g., LLaMA tokenizer), and a small overhead of 6 tokens per turn is added to account for special tokens and formatting. The cumulative token count across turns determines when a split is necessary.
This approach is critical because different tokenizers produce different token counts for the same text. A conversation that fits within 2048 tokens for one model may exceed the limit for another, depending on vocabulary size and tokenization strategy.
Splitting at Turn Boundaries
Conversations are split exclusively at turn boundaries, specifically between human/gpt turn pairs. The principle never splits in the middle of a single message. When the cumulative token count of turns exceeds the maximum length:
- A new sub-conversation is created from the accumulated turns up to (but not including) the current pair.
- The current pair begins a new sub-conversation.
- This continues until all turns are assigned to a sub-conversation.
This ensures that each sub-conversation begins with a human turn and ends with a gpt turn, maintaining the conversational structure required for chat fine-tuning.
Sub-Conversation ID Generation
Each sub-conversation receives a unique identifier derived from the original conversation ID plus a start index suffix. For example, if conversation "abc123" is split at turn indices 0, 4, and 8, the resulting sub-conversations would have IDs "abc123_0", "abc123_4", and "abc123_8". This provides traceability back to the source conversation.
Invalid Role Filtering
After splitting, a secondary validation pass ensures that all resulting sub-conversations maintain strict role alternation (human, gpt, human, gpt, ...). Sub-conversations that fail this check -- which can occur due to edge cases in the splitting logic -- are discarded. This serves as a safety net to guarantee that only well-formed training examples reach downstream stages.
Parallel Processing
To handle large datasets efficiently, the splitting operation is parallelized using ProcessPoolExecutor. The input is divided into chunks of 1000 conversations each, and each chunk is processed independently by a worker. This chunked approach balances parallelism overhead with throughput.
Usage
In the standard FastChat pipeline, long conversation splitting is the third step, applied after language filtering:
python3 -m fastchat.data.split_long_conversation \
--in sharegpt_clean_lang.json \
--out sharegpt_clean_lang_split.json \
--model-name-or-path meta-llama/Llama-2-7b-chat-hf \
--max-length 4096
The --model-name-or-path parameter is required because the tokenizer determines how text maps to tokens, which directly affects where splits occur.
Theoretical Basis
The need for conversation splitting arises from the fixed context window constraint in transformer architectures:
- Positional encoding limits: Transformers use positional encodings (learned or sinusoidal) that have a fixed maximum length. Training on sequences longer than this limit is undefined behavior and causes errors.
- Memory scaling: Self-attention has O(n^2) memory complexity with respect to sequence length. Even if positional encodings allowed longer sequences, memory constraints impose practical limits.
- Training signal preservation: By splitting at turn boundaries rather than arbitrary positions, the model still learns complete question-answer exchanges. Mid-turn splits would create partial messages that confuse the training objective.
- Turn-pair integrity: Maintaining human/gpt pairs is essential for the chat fine-tuning loss function, which typically masks the human turns and computes loss only on the gpt turns. An incomplete pair would either waste compute (a human turn with no response) or create an orphaned response.
- Token-level accuracy: Using the actual model tokenizer for length calculation avoids the approximation errors inherent in character-based or word-based estimates. This is especially important for non-English text and code, where the relationship between characters and tokens varies widely.
Related Pages
- Implementation:Lm_sys_FastChat_Split_Long_Conversation
- Implementation:Lm_sys_FastChat_Split_Long_Conversation -- The implementation that realizes this principle
- Principle:Lm_sys_FastChat_Language_Based_Filtering -- Previous pipeline stage: language filtering
- Principle:Lm_sys_FastChat_Conversation_Format_Validation -- Next pipeline stage: format validation