Implementation:Lm sys FastChat Filter Wrong Format

Field	Value
Page Type	Implementation
Title	Filter Wrong Format
Repository	lm-sys/FastChat
Knowledge Sources	Source Code Analysis, API Documentation
Domains	Data Preprocessing, NLP Pipeline, Data Quality
Last Updated	2026-02-07 14:00 GMT

Overview

Filter Wrong Format is the implementation module that detects and removes conversations containing malformed numbered list patterns from the ShareGPT training data. It uses a single compiled regex pattern to identify duplicate "1." list items that indicate broken sequential numbering, and excludes any conversation containing such artifacts.

Description

This module provides a lightweight filtering step that reads a JSON file of ShareGPT conversations, checks each conversation for the wrong-format pattern, and writes out only the conversations that pass the check. The core logic resides in a single function, should_skip, which iterates over all turns in a conversation and tests each value against the compiled regex.

The module is intentionally minimal -- it addresses one specific, well-defined data quality issue. In the default pipeline configuration, it overwrites its input file, acting as an in-place filter.

Usage

CLI Invocation

python3 -m fastchat.data.filter_wrong_format --in input.json --out output.json

CLI Parameters

Parameter	Type	Required	Default	Description
`--in-file`	str	Yes	--	Path to input JSON file (split conversations)
`--out-file`	str	Yes	--	Path to output JSON file (can be same as input for in-place filtering)

Programmatic Import

from fastchat.data.filter_wrong_format import should_skip

Code Reference

Source Location

Item	Location
Module	`fastchat/data/filter_wrong_format.py`
should_skip function	Lines 17-25
wrong_indices_pattern	Line 14
Main/CLI	Lines 28-44
Repository	github.com/lm-sys/FastChat

Function Signatures

# Module-level compiled regex pattern
wrong_indices_pattern = re.compile("\n1\. [^2]*\n1\. ")

def should_skip(conv) -> bool:
    """
    Check if a conversation contains wrong list indices.

    Iterates over all turns (both human and gpt) in the conversation
    and searches for the wrong_indices_pattern. This pattern matches
    text where two '1.' list markers appear without a '2' character
    between them, indicating duplicate/broken list numbering.

    Args:
        conv: dict with "conversations" key containing list of
              {"from": str, "value": str} dicts.

    Returns:
        True if conversation contains malformed numbering and should
        be skipped (excluded), False otherwise.
    """

Pattern Details

The compiled regex "\n1\. [^2]*\n1\. " breaks down as:

Component	Meaning
`\n`	Newline character
`1\.`	Literal "1." (escaped dot)
	Space after the number
`[^2]*`	Zero or more characters that are NOT "2"
`\n`	Another newline
`1\.`	Another literal "1."
	Space after the second number

This pattern catches cases like \n1. Some text here\n1. More text where the second item should have been numbered "2." or higher.

Import

from fastchat.data.filter_wrong_format import should_skip

I/O Contract

Inputs

Input	Type	Description
in_file	JSON file	Split ShareGPT JSON from the conversation splitting step: a list of dicts with `"id"` and `"conversations"` fields.

Outputs

Output	Type	Description
out_file	JSON file	Format-validated JSON: same structure as input, with conversations containing malformed numbered lists removed. In the pipeline, this typically overwrites the input file.

Dependencies

Package	Purpose
re	Standard library regex module (compiled pattern)
tqdm	Progress bar

No external dependencies beyond the Python standard library and tqdm.

Usage Examples

Pipeline Usage (from prepare_all.py)

# Note: in the pipeline, input and output are the same file (in-place filtering)
python3 -m fastchat.data.filter_wrong_format \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang_split.json \
    --out ~/datasets/sharegpt_20230521_4k_clean_lang_split.json

Programmatic Usage

import json
from fastchat.data.filter_wrong_format import should_skip

content = json.load(open("sharegpt_split.json", "r"))
filtered = [conv for conv in content if not should_skip(conv)]

print(f"#in: {len(content)}, #out: {len(filtered)}")
json.dump(filtered, open("sharegpt_split_filtered.json", "w"), indent=2, ensure_ascii=False)

Testing the Pattern

from fastchat.data.filter_wrong_format import should_skip

# This should be skipped (duplicate "1." items)
bad_conv = {
    "id": "test_bad",
    "conversations": [
        {"from": "human", "value": "List some tips"},
        {"from": "gpt", "value": "Here are tips:\n1. First tip\n1. Second tip"}
    ]
}
assert should_skip(bad_conv) == True

# This should NOT be skipped (correct numbering)
good_conv = {
    "id": "test_good",
    "conversations": [
        {"from": "human", "value": "List some tips"},
        {"from": "gpt", "value": "Here are tips:\n1. First tip\n2. Second tip"}
    ]
}
assert should_skip(good_conv) == False

Related Pages

Principle:Lm_sys_FastChat_Conversation_Format_Validation
Principle:Lm_sys_FastChat_Conversation_Format_Validation -- The principle that this implementation realizes
Implementation:Lm_sys_FastChat_Split_Long_Conversation -- Previous pipeline step: conversation splitting
Implementation:Lm_sys_FastChat_Split_Train_Test -- Next pipeline step: train/test splitting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment