Implementation:Lm sys FastChat Split Long Conversation

Field	Value
Page Type	Implementation
Title	Split Long Conversation
Repository	lm-sys/FastChat
Knowledge Sources	Source Code Analysis, API Documentation
Domains	Data Preprocessing, NLP Pipeline, Tokenization, Context Window Management
Last Updated	2026-02-07 14:00 GMT

Overview

Split Long Conversation is the implementation module that splits conversations exceeding a model's maximum token length into smaller sub-conversations at turn boundaries. It uses a model-specific tokenizer for accurate token counting and processes conversations in parallel via ProcessPoolExecutor with 1000-item chunks.

Description

This module reads a cleaned and language-filtered JSON file of ShareGPT conversations, tokenizes each turn to measure length, and splits any conversation that exceeds the configured maximum token length. Splits always occur between human/gpt turn pairs to maintain conversational integrity. After splitting, an additional pass filters out any sub-conversations with invalid role alternation.

The module uses module-level global variables (tokenizer and max_length) to share the tokenizer instance across parallel workers, as these are set by the split_all function before spawning worker processes.

Usage

CLI Invocation

python3 -m fastchat.data.split_long_conversation \
    --in sharegpt_clean_lang.json \
    --out sharegpt_clean_lang_split.json \
    --model-name-or-path meta-llama/Llama-2-7b-chat-hf \
    --max-length 2048

CLI Parameters

Parameter	Type	Required	Default	Description
`--in-file`	str	Yes	--	Path to input JSON file (language-filtered conversations)
`--out-file`	str	No	`sharegpt_split.json`	Path to output JSON file
`--begin`	int	No	None	Start index for slicing input content
`--end`	int	No	None	End index for slicing input content
`--model-name-or-path`	str	Yes	--	HuggingFace model name or local path (used to load tokenizer)
`--max-length`	int	No	`2048`	Maximum token length per sub-conversation

Programmatic Import

from fastchat.data.split_long_conversation import split_all, split_one_sample, filter_invalid_roles

Code Reference

Source Location

Item	Location
Module	`fastchat/data/split_long_conversation.py`
Core Functions	Lines 30-102
make_sample	Lines 18-24
Repository	github.com/lm-sys/FastChat

Function Signatures

def split_all(content, begin, end, tokenizer_, max_length_) -> list[dict]:
    """
    Keep the maximum round of conversations within the max token length constraint.
    Processes conversations in parallel using ProcessPoolExecutor with 1000-item chunks.
    Sets global tokenizer and max_length, then dispatches to worker functions.

    Args:
        content: list of conversation dicts
        begin: start index for slicing (can be None)
        end: end index for slicing (can be None)
        tokenizer_: HuggingFace tokenizer instance
        max_length_: maximum token length per sub-conversation

    Returns:
        List of split conversation dicts.
    """

def split_one_sample(sample) -> list[dict]:
    """
    Split one conversation at turn boundaries based on token count.
    Uses the global tokenizer to measure each turn's length (+6 token overhead).
    Iterates through human/gpt pairs, accumulating token counts, and creates
    a new sub-conversation when the accumulated length exceeds max_length.

    Args:
        sample: dict with "id", "conversations", and optional "model" keys.

    Returns:
        List of sub-conversation dicts. Empty list if input is invalid
        (odd number of turns or fewer than 2 turns).
    """

def filter_invalid_roles(content) -> list[dict]:
    """
    Filter out conversations that don't have strictly alternating human/gpt roles.
    Ensures each conversation starts with 'human' and alternates correctly.

    Args:
        content: list of conversation dicts.

    Returns:
        Filtered list containing only valid conversations.
    """

def make_sample(sample, start_idx, end_idx) -> dict:
    """
    Create a sub-conversation dict from a slice of the original conversation.
    Generates a new ID by appending '_<start_idx>' to the original ID.

    Args:
        sample: original conversation dict
        start_idx: start turn index (inclusive)
        end_idx: end turn index (exclusive)

    Returns:
        New conversation dict with sliced conversations and derived ID.
    """

Import

from fastchat.data.split_long_conversation import split_all
from fastchat.data.split_long_conversation import split_one_sample
from fastchat.data.split_long_conversation import filter_invalid_roles
from fastchat.data.split_long_conversation import make_sample

I/O Contract

Inputs

Input	Type	Description
in_file	JSON file	Language-filtered ShareGPT JSON: a list of dicts with `"id"`, `"conversations"`, and optional `"model"` fields. Conversations should already be in Markdown format.
tokenizer model	HuggingFace model	The model name or path used to instantiate `transformers.AutoTokenizer`. This determines how text is tokenized for length calculation.

Outputs

Output	Type	Description
out_file	JSON file	Split JSON where long conversations have been divided into multiple sub-conversations. Each sub-conversation fits within `max_length` tokens and has a unique ID (original ID + start index suffix).

Example: A conversation with ID "abc123" that spans 6000 tokens with max_length=2048 might produce:

"abc123_0" -- turns 0-5 (~1800 tokens)
"abc123_6" -- turns 6-11 (~2000 tokens)
"abc123_12" -- turns 12-15 (~1500 tokens)

Dependencies

Package	Purpose
transformers	AutoTokenizer for token-based length measurement
tqdm	Progress bar for parallel processing

Usage Examples

Pipeline Usage (from prepare_all.py)

python3 -m fastchat.data.split_long_conversation \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang.json \
    --out ~/datasets/sharegpt_20230521_4k_clean_lang_split.json \
    --model-name meta-llama/Llama-2-7b-chat-hf \
    --max-length 4096

Programmatic Usage

import json
import transformers
from fastchat.data.split_long_conversation import split_all, filter_invalid_roles

tokenizer = transformers.AutoTokenizer.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    model_max_length=2048,
    padding_side="right",
    use_fast=False,
)

content = json.load(open("sharegpt_clean_lang.json", "r"))
split_content = split_all(content, begin=0, end=len(content), tokenizer_=tokenizer, max_length_=2048)
valid_content = filter_invalid_roles(split_content)

print(f"#in: {len(content)}, #out: {len(valid_content)}")
json.dump(valid_content, open("sharegpt_split.json", "w"), indent=2, ensure_ascii=False)

Related Pages

Principle:Lm_sys_FastChat_Long_Conversation_Splitting
Principle:Lm_sys_FastChat_Long_Conversation_Splitting -- The principle that this implementation realizes
Implementation:Lm_sys_FastChat_Optional_Clean -- Previous pipeline step: language filtering
Implementation:Lm_sys_FastChat_Filter_Wrong_Format -- Next pipeline step: format validation
Heuristic:Lm_sys_FastChat_Conversation_Splitting_Token_Buffer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment