Implementation:Lm sys FastChat Clean ShareGPT
| Field | Value |
|---|---|
| Page Type | Implementation |
| Title | Clean ShareGPT |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source Code Analysis, API Documentation |
| Domains | Data Preprocessing, NLP Pipeline, HTML Parsing |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Clean ShareGPT is the implementation module responsible for converting raw HTML-formatted ShareGPT conversations into clean Markdown text suitable for language model fine-tuning. It handles HTML-to-Markdown conversion, code block reformatting, blocked word and response filtering, deduplication (by ID and content hash), role alternation enforcement, and parallel processing via ProcessPoolExecutor.
Description
This module implements the first stage of the FastChat ShareGPT data pipeline. It reads a JSON file containing raw ShareGPT conversations with HTML-encoded content, processes each conversation in parallel to convert HTML to Markdown, filters out invalid or blocked conversations, removes duplicates, and writes the cleaned result to an output JSON file.
The module defines several regex patterns at module scope for detecting and reformatting code blocks, removing ShareGPT UI artifacts (like "Copy code" buttons and regeneration counters), and stripping unnecessary HTML container elements.
Usage
CLI Invocation
python3 -m fastchat.data.clean_sharegpt --in sharegpt_html.json --out sharegpt_clean.json
CLI Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
--in-file |
str | Yes | -- | Path to input JSON file with raw HTML conversations |
--out-file |
str | No | sharegpt_clean.json |
Path to output cleaned JSON file |
--begin |
int | No | None | Start index for slicing input content |
--end |
int | No | None | End index for slicing input content |
--debug |
flag | No | False | Enable debug mode |
Programmatic Import
from fastchat.data.clean_sharegpt import clean_html_all
Code Reference
Source Location
| Item | Location |
|---|---|
| Module | fastchat/data/clean_sharegpt.py
|
| Core Functions | Lines 86-215 |
| Repository | github.com/lm-sys/FastChat |
Function Signatures
def clean_html_all(content, begin, end) -> list[dict]:
"""
Clean the source html files.
Processes all conversations in parallel using ProcessPoolExecutor,
then deduplicates by ID and by content (first human + first gpt value).
Returns a list of cleaned conversation dicts.
"""
def clean_html_one_sample(sample) -> tuple[dict, int]:
"""
Clean one conversation sample. Converts HTML to markdown,
filters blocked words/responses, validates role alternation.
Returns (sample, error_code) where error_code:
0 = success
1 = too short / empty
2 = wrong role format
3 = blocked words or blocked response
4 = parser error (bs4/markdownify)
"""
def html_to_markdown(val: str) -> str:
"""
Convert HTML string to Markdown.
Strips <div> and <span> tags, applies markdownify, reformats code blocks,
removes ShareGPT UI artifacts (regeneration counters, 'Copy code' noise).
"""
def reformat_code(val: str) -> str:
"""
Reformat code blocks from ShareGPT's HTML export format
into standard Markdown fenced code blocks.
"""
def contain_blocked_words(val: str) -> bool:
"""
Returns True if val contains 'openai' or 'chatgpt' (case-insensitive).
"""
def contain_blocked_responses(role: str, val: str) -> bool:
"""
Returns True if role is 'gpt' and val starts with a known
blocked response (e.g., rate-limit error messages).
"""
Import
from fastchat.data.clean_sharegpt import clean_html_all
from fastchat.data.clean_sharegpt import clean_html_one_sample
from fastchat.data.clean_sharegpt import html_to_markdown
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
| in_file | JSON file | Raw ShareGPT JSON: a list of dicts, each with "id" (str) and "conversations" (list of {"from": str, "value": str}). The "value" fields contain raw HTML content.
|
Example input structure:
[
{
"id": "abc123",
"conversations": [
{"from": "human", "value": "<div><p>Hello, how are you?</p></div>"},
{"from": "gpt", "value": "<div><p>I'm doing well!</p></div>"}
]
}
]
Outputs
| Output | Type | Description |
|---|---|---|
| out_file | JSON file | Cleaned JSON: same structure but with HTML converted to Markdown, duplicates removed, blocked content filtered, and invalid role sequences discarded. |
Example output structure:
[
{
"id": "abc123",
"conversations": [
{"from": "human", "value": "Hello, how are you?"},
{"from": "gpt", "value": "I'm doing well!"}
]
}
]
Dependencies
| Package | Version | Purpose |
|---|---|---|
| bs4 (BeautifulSoup) | -- | HTML parsing |
| markdownify | 0.11.6 | HTML-to-Markdown conversion |
| tqdm | -- | Progress bar for parallel processing |
Usage Examples
Pipeline Usage (as part of prepare_all.py)
python3 -m fastchat.data.clean_sharegpt \
--in ~/datasets/sharegpt_20230521_html.json \
--out ~/datasets/sharegpt_20230521_4k_clean.json
Programmatic Usage
import json
from fastchat.data.clean_sharegpt import clean_html_all
content = json.load(open("sharegpt_html.json", "r"))
cleaned = clean_html_all(content, begin=0, end=len(content))
json.dump(cleaned, open("sharegpt_clean.json", "w"), indent=2, ensure_ascii=False)
Single Sample Cleaning
from fastchat.data.clean_sharegpt import clean_html_one_sample
sample = {
"id": "test_001",
"conversations": [
{"from": "human", "value": "<div>What is Python?</div>"},
{"from": "gpt", "value": "<div><p>Python is a programming language.</p></div>"}
]
}
cleaned_sample, error_code = clean_html_one_sample(sample)
if error_code == 0:
print("Cleaned successfully:", cleaned_sample)
Related Pages
- Principle:Lm_sys_FastChat_ShareGPT_HTML_Cleaning
- Principle:Lm_sys_FastChat_ShareGPT_HTML_Cleaning -- The principle that this implementation realizes
- Implementation:Lm_sys_FastChat_Optional_Clean -- Next pipeline step: language-based filtering
- Implementation:Lm_sys_FastChat_Split_Long_Conversation -- Downstream: conversation splitting