Principle:Lm sys FastChat Conversation Format Validation
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Conversation Format Validation |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source Code Analysis, API Documentation |
| Domains | Data Preprocessing, NLP Pipeline, Data Quality |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Conversation Format Validation is a data quality principle in the FastChat ShareGPT Data Pipeline that detects and removes conversations containing malformed numbered lists. Specifically, it targets a known artifact where GPT responses contain duplicate "1." list item markers without proper sequential numbering, indicating a corrupted or incorrectly rendered response.
Description
The Problem: Malformed Numbered Lists
A recurring quality issue in ShareGPT-sourced data involves numbered lists where multiple list items are incorrectly numbered as "1." instead of following sequential numbering (1, 2, 3, ...). This typically occurs when:
- The HTML-to-Markdown conversion introduces rendering artifacts in ordered lists
- The original ChatGPT response contained a formatting error
- Browser rendering inconsistencies caused the exported HTML to flatten list numbering
For example, a malformed response might contain:
Here are some tips:
1. First tip about something
1. Second tip that should be numbered 2
Instead of the correct:
Here are some tips:
1. First tip about something
2. Second tip about something else
Regex-Based Pattern Matching
The detection mechanism uses a compiled regular expression pattern: \n1\. [^2]*\n1\.
This pattern matches text where:
- A newline followed by "1. " introduces a list item
- The content of that item does not contain the character "2" (ensuring we are not looking at a properly numbered list where "2." appears shortly after)
- Another newline followed by "1. " appears, indicating a duplicate first-item marker
The regex check [^2]* between the two "1." markers serves as a heuristic: in a correctly formatted list, the text between "1." and the next numbered item would typically contain "2." (the next sequential number). The absence of "2" in this span strongly suggests the numbering is broken.
Conversation-Level Filtering
The filter examines all turns in a conversation (both human and gpt messages). If any turn matches the malformed list pattern, the entire conversation is excluded. This conservative approach ensures that training data does not contain any formatting artifacts that could teach the model to produce similarly malformed outputs.
Usage
In the standard FastChat pipeline, format validation is the fourth step, applied after conversation splitting:
python3 -m fastchat.data.filter_wrong_format \
--in sharegpt_clean_lang_split.json \
--out sharegpt_clean_lang_split.json
Note that in the default pipeline configuration (from prepare_all.py), the output file overwrites the input file. This is intentional -- the format filter is a pass-through that only removes problematic entries without changing any data structure.
Theoretical Basis
Format validation in training data is grounded in the principle of garbage in, garbage out:
- Output format learning: Language models learn formatting patterns from their training data. If the training set contains malformed numbered lists, the model will reproduce these errors in its outputs, particularly when asked to generate lists.
- Consistency over quantity: Removing a small number of malformed conversations has a negligible impact on dataset size but significantly improves the quality of list-formatted outputs. The trade-off strongly favors filtering.
- Pattern specificity: The chosen regex pattern is deliberately narrow -- it targets only the specific "duplicate 1." artifact rather than attempting to validate all possible list formats. This minimizes false positives while catching the most common formatting error observed in the ShareGPT dataset.
- Heuristic validation: The
[^2]*component is a heuristic rather than a complete parser. It works well in practice because correctly numbered lists almost always have "2" appearing between consecutive "1." markers (as part of "2."), while malformed lists repeat "1." without any "2" in between.
Related Pages
- Implementation:Lm_sys_FastChat_Filter_Wrong_Format
- Implementation:Lm_sys_FastChat_Filter_Wrong_Format -- The implementation that realizes this principle
- Principle:Lm_sys_FastChat_Long_Conversation_Splitting -- Previous pipeline stage: conversation splitting
- Principle:Lm_sys_FastChat_Train_Test_Data_Splitting -- Next pipeline stage: train/test splitting