Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat Split Long Conversation

From Leeroopedia


Field Value
Page Type Implementation
Title Split Long Conversation
Repository lm-sys/FastChat
Knowledge Sources Source Code Analysis, API Documentation
Domains Data Preprocessing, NLP Pipeline, Tokenization, Context Window Management
Last Updated 2026-02-07 14:00 GMT

Overview

Split Long Conversation is the implementation module that splits conversations exceeding a model's maximum token length into smaller sub-conversations at turn boundaries. It uses a model-specific tokenizer for accurate token counting and processes conversations in parallel via ProcessPoolExecutor with 1000-item chunks.

Description

This module reads a cleaned and language-filtered JSON file of ShareGPT conversations, tokenizes each turn to measure length, and splits any conversation that exceeds the configured maximum token length. Splits always occur between human/gpt turn pairs to maintain conversational integrity. After splitting, an additional pass filters out any sub-conversations with invalid role alternation.

The module uses module-level global variables (tokenizer and max_length) to share the tokenizer instance across parallel workers, as these are set by the split_all function before spawning worker processes.

Usage

CLI Invocation

python3 -m fastchat.data.split_long_conversation \
    --in sharegpt_clean_lang.json \
    --out sharegpt_clean_lang_split.json \
    --model-name-or-path meta-llama/Llama-2-7b-chat-hf \
    --max-length 2048

CLI Parameters

Parameter Type Required Default Description
--in-file str Yes -- Path to input JSON file (language-filtered conversations)
--out-file str No sharegpt_split.json Path to output JSON file
--begin int No None Start index for slicing input content
--end int No None End index for slicing input content
--model-name-or-path str Yes -- HuggingFace model name or local path (used to load tokenizer)
--max-length int No 2048 Maximum token length per sub-conversation

Programmatic Import

from fastchat.data.split_long_conversation import split_all, split_one_sample, filter_invalid_roles

Code Reference

Source Location

Item Location
Module fastchat/data/split_long_conversation.py
Core Functions Lines 30-102
make_sample Lines 18-24
Repository github.com/lm-sys/FastChat

Function Signatures

def split_all(content, begin, end, tokenizer_, max_length_) -> list[dict]:
    """
    Keep the maximum round of conversations within the max token length constraint.
    Processes conversations in parallel using ProcessPoolExecutor with 1000-item chunks.
    Sets global tokenizer and max_length, then dispatches to worker functions.

    Args:
        content: list of conversation dicts
        begin: start index for slicing (can be None)
        end: end index for slicing (can be None)
        tokenizer_: HuggingFace tokenizer instance
        max_length_: maximum token length per sub-conversation

    Returns:
        List of split conversation dicts.
    """

def split_one_sample(sample) -> list[dict]:
    """
    Split one conversation at turn boundaries based on token count.
    Uses the global tokenizer to measure each turn's length (+6 token overhead).
    Iterates through human/gpt pairs, accumulating token counts, and creates
    a new sub-conversation when the accumulated length exceeds max_length.

    Args:
        sample: dict with "id", "conversations", and optional "model" keys.

    Returns:
        List of sub-conversation dicts. Empty list if input is invalid
        (odd number of turns or fewer than 2 turns).
    """

def filter_invalid_roles(content) -> list[dict]:
    """
    Filter out conversations that don't have strictly alternating human/gpt roles.
    Ensures each conversation starts with 'human' and alternates correctly.

    Args:
        content: list of conversation dicts.

    Returns:
        Filtered list containing only valid conversations.
    """

def make_sample(sample, start_idx, end_idx) -> dict:
    """
    Create a sub-conversation dict from a slice of the original conversation.
    Generates a new ID by appending '_<start_idx>' to the original ID.

    Args:
        sample: original conversation dict
        start_idx: start turn index (inclusive)
        end_idx: end turn index (exclusive)

    Returns:
        New conversation dict with sliced conversations and derived ID.
    """

Import

from fastchat.data.split_long_conversation import split_all
from fastchat.data.split_long_conversation import split_one_sample
from fastchat.data.split_long_conversation import filter_invalid_roles
from fastchat.data.split_long_conversation import make_sample

I/O Contract

Inputs

Input Type Description
in_file JSON file Language-filtered ShareGPT JSON: a list of dicts with "id", "conversations", and optional "model" fields. Conversations should already be in Markdown format.
tokenizer model HuggingFace model The model name or path used to instantiate transformers.AutoTokenizer. This determines how text is tokenized for length calculation.

Outputs

Output Type Description
out_file JSON file Split JSON where long conversations have been divided into multiple sub-conversations. Each sub-conversation fits within max_length tokens and has a unique ID (original ID + start index suffix).

Example: A conversation with ID "abc123" that spans 6000 tokens with max_length=2048 might produce:

  • "abc123_0" -- turns 0-5 (~1800 tokens)
  • "abc123_6" -- turns 6-11 (~2000 tokens)
  • "abc123_12" -- turns 12-15 (~1500 tokens)

Dependencies

Package Purpose
transformers AutoTokenizer for token-based length measurement
tqdm Progress bar for parallel processing

Usage Examples

Pipeline Usage (from prepare_all.py)

python3 -m fastchat.data.split_long_conversation \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang.json \
    --out ~/datasets/sharegpt_20230521_4k_clean_lang_split.json \
    --model-name meta-llama/Llama-2-7b-chat-hf \
    --max-length 4096

Programmatic Usage

import json
import transformers
from fastchat.data.split_long_conversation import split_all, filter_invalid_roles

tokenizer = transformers.AutoTokenizer.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    model_max_length=2048,
    padding_side="right",
    use_fast=False,
)

content = json.load(open("sharegpt_clean_lang.json", "r"))
split_content = split_all(content, begin=0, end=len(content), tokenizer_=tokenizer, max_length_=2048)
valid_content = filter_invalid_roles(split_content)

print(f"#in: {len(content)}, #out: {len(valid_content)}")
json.dump(valid_content, open("sharegpt_split.json", "w"), indent=2, ensure_ascii=False)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment