Implementation:Lm sys FastChat Filter Wrong Format
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Filter Wrong Format |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source Code Analysis, API Documentation |
| Domains | Data Preprocessing, NLP Pipeline, Data Quality |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Filter Wrong Format is the implementation module that detects and removes conversations containing malformed numbered list patterns from the ShareGPT training data. It uses a single compiled regex pattern to identify duplicate "1." list items that indicate broken sequential numbering, and excludes any conversation containing such artifacts.
Description
This module provides a lightweight filtering step that reads a JSON file of ShareGPT conversations, checks each conversation for the wrong-format pattern, and writes out only the conversations that pass the check. The core logic resides in a single function, should_skip, which iterates over all turns in a conversation and tests each value against the compiled regex.
The module is intentionally minimal -- it addresses one specific, well-defined data quality issue. In the default pipeline configuration, it overwrites its input file, acting as an in-place filter.
Usage
CLI Invocation
python3 -m fastchat.data.filter_wrong_format --in input.json --out output.json
CLI Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--in-file |
str | Yes | -- | Path to input JSON file (split conversations) |
--out-file |
str | Yes | -- | Path to output JSON file (can be same as input for in-place filtering) |
Programmatic Import
from fastchat.data.filter_wrong_format import should_skip
Code Reference
Source Location
| Item | Location |
|---|---|
| Module | fastchat/data/filter_wrong_format.py
|
| should_skip function | Lines 17-25 |
| wrong_indices_pattern | Line 14 |
| Main/CLI | Lines 28-44 |
| Repository | github.com/lm-sys/FastChat |
Function Signatures
# Module-level compiled regex pattern
wrong_indices_pattern = re.compile("\n1\. [^2]*\n1\. ")
def should_skip(conv) -> bool:
"""
Check if a conversation contains wrong list indices.
Iterates over all turns (both human and gpt) in the conversation
and searches for the wrong_indices_pattern. This pattern matches
text where two '1.' list markers appear without a '2' character
between them, indicating duplicate/broken list numbering.
Args:
conv: dict with "conversations" key containing list of
{"from": str, "value": str} dicts.
Returns:
True if conversation contains malformed numbering and should
be skipped (excluded), False otherwise.
"""
Pattern Details
The compiled regex "\n1\. [^2]*\n1\. " breaks down as:
| Component | Meaning |
|---|---|
\n |
Newline character |
1\. |
Literal "1." (escaped dot) |
|
Space after the number |
[^2]* |
Zero or more characters that are NOT "2" |
\n |
Another newline |
1\. |
Another literal "1." |
|
Space after the second number |
This pattern catches cases like \n1. Some text here\n1. More text where the second item should have been numbered "2." or higher.
Import
from fastchat.data.filter_wrong_format import should_skip
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
| in_file | JSON file | Split ShareGPT JSON from the conversation splitting step: a list of dicts with "id" and "conversations" fields.
|
Outputs
| Output | Type | Description |
|---|---|---|
| out_file | JSON file | Format-validated JSON: same structure as input, with conversations containing malformed numbered lists removed. In the pipeline, this typically overwrites the input file. |
Dependencies
| Package | Purpose |
|---|---|
| re | Standard library regex module (compiled pattern) |
| tqdm | Progress bar |
No external dependencies beyond the Python standard library and tqdm.
Usage Examples
Pipeline Usage (from prepare_all.py)
# Note: in the pipeline, input and output are the same file (in-place filtering)
python3 -m fastchat.data.filter_wrong_format \
--in ~/datasets/sharegpt_20230521_4k_clean_lang_split.json \
--out ~/datasets/sharegpt_20230521_4k_clean_lang_split.json
Programmatic Usage
import json
from fastchat.data.filter_wrong_format import should_skip
content = json.load(open("sharegpt_split.json", "r"))
filtered = [conv for conv in content if not should_skip(conv)]
print(f"#in: {len(content)}, #out: {len(filtered)}")
json.dump(filtered, open("sharegpt_split_filtered.json", "w"), indent=2, ensure_ascii=False)
Testing the Pattern
from fastchat.data.filter_wrong_format import should_skip
# This should be skipped (duplicate "1." items)
bad_conv = {
"id": "test_bad",
"conversations": [
{"from": "human", "value": "List some tips"},
{"from": "gpt", "value": "Here are tips:\n1. First tip\n1. Second tip"}
]
}
assert should_skip(bad_conv) == True
# This should NOT be skipped (correct numbering)
good_conv = {
"id": "test_good",
"conversations": [
{"from": "human", "value": "List some tips"},
{"from": "gpt", "value": "Here are tips:\n1. First tip\n2. Second tip"}
]
}
assert should_skip(good_conv) == False
Related Pages
- Principle:Lm_sys_FastChat_Conversation_Format_Validation
- Principle:Lm_sys_FastChat_Conversation_Format_Validation -- The principle that this implementation realizes
- Implementation:Lm_sys_FastChat_Split_Long_Conversation -- Previous pipeline step: conversation splitting
- Implementation:Lm_sys_FastChat_Split_Train_Test -- Next pipeline step: train/test splitting