| Field |
Value
|
| Page Type |
Implementation
|
| Title |
Split Long Conversation
|
| Repository |
lm-sys/FastChat
|
| Knowledge Sources |
Source Code Analysis, API Documentation
|
| Domains |
Data Preprocessing, NLP Pipeline, Tokenization, Context Window Management
|
| Last Updated |
2026-02-07 14:00 GMT
|
Overview
Split Long Conversation is the implementation module that splits conversations exceeding a model's maximum token length into smaller sub-conversations at turn boundaries. It uses a model-specific tokenizer for accurate token counting and processes conversations in parallel via ProcessPoolExecutor with 1000-item chunks.
Description
This module reads a cleaned and language-filtered JSON file of ShareGPT conversations, tokenizes each turn to measure length, and splits any conversation that exceeds the configured maximum token length. Splits always occur between human/gpt turn pairs to maintain conversational integrity. After splitting, an additional pass filters out any sub-conversations with invalid role alternation.
The module uses module-level global variables (tokenizer and max_length) to share the tokenizer instance across parallel workers, as these are set by the split_all function before spawning worker processes.
Usage
CLI Invocation
python3 -m fastchat.data.split_long_conversation \
--in sharegpt_clean_lang.json \
--out sharegpt_clean_lang_split.json \
--model-name-or-path meta-llama/Llama-2-7b-chat-hf \
--max-length 2048
CLI Parameters
| Parameter |
Type |
Required |
Default |
Description
|
--in-file |
str |
Yes |
-- |
Path to input JSON file (language-filtered conversations)
|
--out-file |
str |
No |
sharegpt_split.json |
Path to output JSON file
|
--begin |
int |
No |
None |
Start index for slicing input content
|
--end |
int |
No |
None |
End index for slicing input content
|
--model-name-or-path |
str |
Yes |
-- |
HuggingFace model name or local path (used to load tokenizer)
|
--max-length |
int |
No |
2048 |
Maximum token length per sub-conversation
|
Programmatic Import
from fastchat.data.split_long_conversation import split_all, split_one_sample, filter_invalid_roles
Code Reference
Source Location
| Item |
Location
|
| Module |
fastchat/data/split_long_conversation.py
|
| Core Functions |
Lines 30-102
|
| make_sample |
Lines 18-24
|
| Repository |
github.com/lm-sys/FastChat
|
Function Signatures
def split_all(content, begin, end, tokenizer_, max_length_) -> list[dict]:
"""
Keep the maximum round of conversations within the max token length constraint.
Processes conversations in parallel using ProcessPoolExecutor with 1000-item chunks.
Sets global tokenizer and max_length, then dispatches to worker functions.
Args:
content: list of conversation dicts
begin: start index for slicing (can be None)
end: end index for slicing (can be None)
tokenizer_: HuggingFace tokenizer instance
max_length_: maximum token length per sub-conversation
Returns:
List of split conversation dicts.
"""
def split_one_sample(sample) -> list[dict]:
"""
Split one conversation at turn boundaries based on token count.
Uses the global tokenizer to measure each turn's length (+6 token overhead).
Iterates through human/gpt pairs, accumulating token counts, and creates
a new sub-conversation when the accumulated length exceeds max_length.
Args:
sample: dict with "id", "conversations", and optional "model" keys.
Returns:
List of sub-conversation dicts. Empty list if input is invalid
(odd number of turns or fewer than 2 turns).
"""
def filter_invalid_roles(content) -> list[dict]:
"""
Filter out conversations that don't have strictly alternating human/gpt roles.
Ensures each conversation starts with 'human' and alternates correctly.
Args:
content: list of conversation dicts.
Returns:
Filtered list containing only valid conversations.
"""
def make_sample(sample, start_idx, end_idx) -> dict:
"""
Create a sub-conversation dict from a slice of the original conversation.
Generates a new ID by appending '_<start_idx>' to the original ID.
Args:
sample: original conversation dict
start_idx: start turn index (inclusive)
end_idx: end turn index (exclusive)
Returns:
New conversation dict with sliced conversations and derived ID.
"""
Import
from fastchat.data.split_long_conversation import split_all
from fastchat.data.split_long_conversation import split_one_sample
from fastchat.data.split_long_conversation import filter_invalid_roles
from fastchat.data.split_long_conversation import make_sample
I/O Contract
Inputs
| Input |
Type |
Description
|
| in_file |
JSON file |
Language-filtered ShareGPT JSON: a list of dicts with "id", "conversations", and optional "model" fields. Conversations should already be in Markdown format.
|
| tokenizer model |
HuggingFace model |
The model name or path used to instantiate transformers.AutoTokenizer. This determines how text is tokenized for length calculation.
|
Outputs
| Output |
Type |
Description
|
| out_file |
JSON file |
Split JSON where long conversations have been divided into multiple sub-conversations. Each sub-conversation fits within max_length tokens and has a unique ID (original ID + start index suffix).
|
Example: A conversation with ID "abc123" that spans 6000 tokens with max_length=2048 might produce:
"abc123_0" -- turns 0-5 (~1800 tokens)
"abc123_6" -- turns 6-11 (~2000 tokens)
"abc123_12" -- turns 12-15 (~1500 tokens)
Dependencies
| Package |
Purpose
|
| transformers |
AutoTokenizer for token-based length measurement
|
| tqdm |
Progress bar for parallel processing
|
Usage Examples
Pipeline Usage (from prepare_all.py)
python3 -m fastchat.data.split_long_conversation \
--in ~/datasets/sharegpt_20230521_4k_clean_lang.json \
--out ~/datasets/sharegpt_20230521_4k_clean_lang_split.json \
--model-name meta-llama/Llama-2-7b-chat-hf \
--max-length 4096
Programmatic Usage
import json
import transformers
from fastchat.data.split_long_conversation import split_all, filter_invalid_roles
tokenizer = transformers.AutoTokenizer.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
model_max_length=2048,
padding_side="right",
use_fast=False,
)
content = json.load(open("sharegpt_clean_lang.json", "r"))
split_content = split_all(content, begin=0, end=len(content), tokenizer_=tokenizer, max_length_=2048)
valid_content = filter_invalid_roles(split_content)
print(f"#in: {len(content)}, #out: {len(valid_content)}")
json.dump(valid_content, open("sharegpt_split.json", "w"), indent=2, ensure_ascii=False)
Related Pages