Principle:Lm sys FastChat Conversation Format Validation

Field	Value
Page Type	Principle
Title	Conversation Format Validation
Repository	lm-sys/FastChat
Knowledge Sources	Source Code Analysis, API Documentation
Domains	Data Preprocessing, NLP Pipeline, Data Quality
Last Updated	2026-02-07 14:00 GMT

Overview

Conversation Format Validation is a data quality principle in the FastChat ShareGPT Data Pipeline that detects and removes conversations containing malformed numbered lists. Specifically, it targets a known artifact where GPT responses contain duplicate "1." list item markers without proper sequential numbering, indicating a corrupted or incorrectly rendered response.

Description

The Problem: Malformed Numbered Lists

A recurring quality issue in ShareGPT-sourced data involves numbered lists where multiple list items are incorrectly numbered as "1." instead of following sequential numbering (1, 2, 3, ...). This typically occurs when:

The HTML-to-Markdown conversion introduces rendering artifacts in ordered lists
The original ChatGPT response contained a formatting error
Browser rendering inconsistencies caused the exported HTML to flatten list numbering

For example, a malformed response might contain:

Here are some tips:
1. First tip about something
1. Second tip that should be numbered 2

Instead of the correct:

Here are some tips:
1. First tip about something
2. Second tip about something else

Regex-Based Pattern Matching

The detection mechanism uses a compiled regular expression pattern: \n1\. [^2]*\n1\.

This pattern matches text where:

A newline followed by "1. " introduces a list item
The content of that item does not contain the character "2" (ensuring we are not looking at a properly numbered list where "2." appears shortly after)
Another newline followed by "1. " appears, indicating a duplicate first-item marker

The regex check [^2]* between the two "1." markers serves as a heuristic: in a correctly formatted list, the text between "1." and the next numbered item would typically contain "2." (the next sequential number). The absence of "2" in this span strongly suggests the numbering is broken.

Conversation-Level Filtering

The filter examines all turns in a conversation (both human and gpt messages). If any turn matches the malformed list pattern, the entire conversation is excluded. This conservative approach ensures that training data does not contain any formatting artifacts that could teach the model to produce similarly malformed outputs.

Usage

In the standard FastChat pipeline, format validation is the fourth step, applied after conversation splitting:

python3 -m fastchat.data.filter_wrong_format \
    --in sharegpt_clean_lang_split.json \
    --out sharegpt_clean_lang_split.json

Note that in the default pipeline configuration (from prepare_all.py), the output file overwrites the input file. This is intentional -- the format filter is a pass-through that only removes problematic entries without changing any data structure.

Theoretical Basis

Format validation in training data is grounded in the principle of garbage in, garbage out:

Output format learning: Language models learn formatting patterns from their training data. If the training set contains malformed numbered lists, the model will reproduce these errors in its outputs, particularly when asked to generate lists.
Consistency over quantity: Removing a small number of malformed conversations has a negligible impact on dataset size but significantly improves the quality of list-formatted outputs. The trade-off strongly favors filtering.
Pattern specificity: The chosen regex pattern is deliberately narrow -- it targets only the specific "duplicate 1." artifact rather than attempting to validate all possible list formats. This minimizes false positives while catching the most common formatting error observed in the ShareGPT dataset.
Heuristic validation: The [^2]* component is a heuristic rather than a complete parser. It works well in practice because correctly numbered lists almost always have "2" appearing between consecutive "1." markers (as part of "2."), while malformed lists repeat "1." without any "2" in between.

Related Pages

Implementation:Lm_sys_FastChat_Filter_Wrong_Format
Implementation:Lm_sys_FastChat_Filter_Wrong_Format -- The implementation that realizes this principle
Principle:Lm_sys_FastChat_Long_Conversation_Splitting -- Previous pipeline stage: conversation splitting
Principle:Lm_sys_FastChat_Train_Test_Data_Splitting -- Next pipeline stage: train/test splitting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment