Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat Filter Wrong Format

From Leeroopedia


Field Value
Page Type Implementation
Title Filter Wrong Format
Repository lm-sys/FastChat
Knowledge Sources Source Code Analysis, API Documentation
Domains Data Preprocessing, NLP Pipeline, Data Quality
Last Updated 2026-02-07 14:00 GMT

Overview

Filter Wrong Format is the implementation module that detects and removes conversations containing malformed numbered list patterns from the ShareGPT training data. It uses a single compiled regex pattern to identify duplicate "1." list items that indicate broken sequential numbering, and excludes any conversation containing such artifacts.

Description

This module provides a lightweight filtering step that reads a JSON file of ShareGPT conversations, checks each conversation for the wrong-format pattern, and writes out only the conversations that pass the check. The core logic resides in a single function, should_skip, which iterates over all turns in a conversation and tests each value against the compiled regex.

The module is intentionally minimal -- it addresses one specific, well-defined data quality issue. In the default pipeline configuration, it overwrites its input file, acting as an in-place filter.

Usage

CLI Invocation

python3 -m fastchat.data.filter_wrong_format --in input.json --out output.json

CLI Parameters

Parameter Type Required Default Description
--in-file str Yes -- Path to input JSON file (split conversations)
--out-file str Yes -- Path to output JSON file (can be same as input for in-place filtering)

Programmatic Import

from fastchat.data.filter_wrong_format import should_skip

Code Reference

Source Location

Item Location
Module fastchat/data/filter_wrong_format.py
should_skip function Lines 17-25
wrong_indices_pattern Line 14
Main/CLI Lines 28-44
Repository github.com/lm-sys/FastChat

Function Signatures

# Module-level compiled regex pattern
wrong_indices_pattern = re.compile("\n1\. [^2]*\n1\. ")

def should_skip(conv) -> bool:
    """
    Check if a conversation contains wrong list indices.

    Iterates over all turns (both human and gpt) in the conversation
    and searches for the wrong_indices_pattern. This pattern matches
    text where two '1.' list markers appear without a '2' character
    between them, indicating duplicate/broken list numbering.

    Args:
        conv: dict with "conversations" key containing list of
              {"from": str, "value": str} dicts.

    Returns:
        True if conversation contains malformed numbering and should
        be skipped (excluded), False otherwise.
    """

Pattern Details

The compiled regex "\n1\. [^2]*\n1\. " breaks down as:

Component Meaning
\n Newline character
1\. Literal "1." (escaped dot)
Space after the number
[^2]* Zero or more characters that are NOT "2"
\n Another newline
1\. Another literal "1."
Space after the second number

This pattern catches cases like \n1. Some text here\n1. More text where the second item should have been numbered "2." or higher.

Import

from fastchat.data.filter_wrong_format import should_skip

I/O Contract

Inputs

Input Type Description
in_file JSON file Split ShareGPT JSON from the conversation splitting step: a list of dicts with "id" and "conversations" fields.

Outputs

Output Type Description
out_file JSON file Format-validated JSON: same structure as input, with conversations containing malformed numbered lists removed. In the pipeline, this typically overwrites the input file.

Dependencies

Package Purpose
re Standard library regex module (compiled pattern)
tqdm Progress bar

No external dependencies beyond the Python standard library and tqdm.

Usage Examples

Pipeline Usage (from prepare_all.py)

# Note: in the pipeline, input and output are the same file (in-place filtering)
python3 -m fastchat.data.filter_wrong_format \
    --in ~/datasets/sharegpt_20230521_4k_clean_lang_split.json \
    --out ~/datasets/sharegpt_20230521_4k_clean_lang_split.json

Programmatic Usage

import json
from fastchat.data.filter_wrong_format import should_skip

content = json.load(open("sharegpt_split.json", "r"))
filtered = [conv for conv in content if not should_skip(conv)]

print(f"#in: {len(content)}, #out: {len(filtered)}")
json.dump(filtered, open("sharegpt_split_filtered.json", "w"), indent=2, ensure_ascii=False)

Testing the Pattern

from fastchat.data.filter_wrong_format import should_skip

# This should be skipped (duplicate "1." items)
bad_conv = {
    "id": "test_bad",
    "conversations": [
        {"from": "human", "value": "List some tips"},
        {"from": "gpt", "value": "Here are tips:\n1. First tip\n1. Second tip"}
    ]
}
assert should_skip(bad_conv) == True

# This should NOT be skipped (correct numbering)
good_conv = {
    "id": "test_good",
    "conversations": [
        {"from": "human", "value": "List some tips"},
        {"from": "gpt", "value": "Here are tips:\n1. First tip\n2. Second tip"}
    ]
}
assert should_skip(good_conv) == False

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment