Implementation:Lm sys FastChat Clean ShareGPT

Field	Value
Page Type	Implementation
Title	Clean ShareGPT
Repository	lm-sys/FastChat
Knowledge Sources	Source Code Analysis, API Documentation
Domains	Data Preprocessing, NLP Pipeline, HTML Parsing
Last Updated	2026-02-07 14:00 GMT

Overview

Clean ShareGPT is the implementation module responsible for converting raw HTML-formatted ShareGPT conversations into clean Markdown text suitable for language model fine-tuning. It handles HTML-to-Markdown conversion, code block reformatting, blocked word and response filtering, deduplication (by ID and content hash), role alternation enforcement, and parallel processing via ProcessPoolExecutor.

Description

This module implements the first stage of the FastChat ShareGPT data pipeline. It reads a JSON file containing raw ShareGPT conversations with HTML-encoded content, processes each conversation in parallel to convert HTML to Markdown, filters out invalid or blocked conversations, removes duplicates, and writes the cleaned result to an output JSON file.

The module defines several regex patterns at module scope for detecting and reformatting code blocks, removing ShareGPT UI artifacts (like "Copy code" buttons and regeneration counters), and stripping unnecessary HTML container elements.

Usage

CLI Invocation

python3 -m fastchat.data.clean_sharegpt --in sharegpt_html.json --out sharegpt_clean.json

CLI Parameters

Parameter	Type	Required	Default	Description
`--in-file`	str	Yes	--	Path to input JSON file with raw HTML conversations
`--out-file`	str	No	`sharegpt_clean.json`	Path to output cleaned JSON file
`--begin`	int	No	None	Start index for slicing input content
`--end`	int	No	None	End index for slicing input content
`--debug`	flag	No	False	Enable debug mode

Programmatic Import

from fastchat.data.clean_sharegpt import clean_html_all

Code Reference

Source Location

Item	Location
Module	`fastchat/data/clean_sharegpt.py`
Core Functions	Lines 86-215
Repository	github.com/lm-sys/FastChat

Function Signatures

def clean_html_all(content, begin, end) -> list[dict]:
    """
    Clean the source html files.
    Processes all conversations in parallel using ProcessPoolExecutor,
    then deduplicates by ID and by content (first human + first gpt value).
    Returns a list of cleaned conversation dicts.
    """

def clean_html_one_sample(sample) -> tuple[dict, int]:
    """
    Clean one conversation sample. Converts HTML to markdown,
    filters blocked words/responses, validates role alternation.
    Returns (sample, error_code) where error_code:
      0 = success
      1 = too short / empty
      2 = wrong role format
      3 = blocked words or blocked response
      4 = parser error (bs4/markdownify)
    """

def html_to_markdown(val: str) -> str:
    """
    Convert HTML string to Markdown.
    Strips <div> and <span> tags, applies markdownify, reformats code blocks,
    removes ShareGPT UI artifacts (regeneration counters, 'Copy code' noise).
    """

def reformat_code(val: str) -> str:
    """
    Reformat code blocks from ShareGPT's HTML export format
    into standard Markdown fenced code blocks.
    """

def contain_blocked_words(val: str) -> bool:
    """
    Returns True if val contains 'openai' or 'chatgpt' (case-insensitive).
    """

def contain_blocked_responses(role: str, val: str) -> bool:
    """
    Returns True if role is 'gpt' and val starts with a known
    blocked response (e.g., rate-limit error messages).
    """

Import

from fastchat.data.clean_sharegpt import clean_html_all
from fastchat.data.clean_sharegpt import clean_html_one_sample
from fastchat.data.clean_sharegpt import html_to_markdown

I/O Contract

Inputs

Input	Type	Description
in_file	JSON file	Raw ShareGPT JSON: a list of dicts, each with `"id"` (str) and `"conversations"` (list of `{"from": str, "value": str}`). The `"value"` fields contain raw HTML content.

Example input structure:

[
  {
    "id": "abc123",
    "conversations": [
      {"from": "human", "value": "<div><p>Hello, how are you?</p></div>"},
      {"from": "gpt", "value": "<div><p>I'm doing well!</p></div>"}
    ]
  }
]

Outputs

Output	Type	Description
out_file	JSON file	Cleaned JSON: same structure but with HTML converted to Markdown, duplicates removed, blocked content filtered, and invalid role sequences discarded.

Example output structure:

[
  {
    "id": "abc123",
    "conversations": [
      {"from": "human", "value": "Hello, how are you?"},
      {"from": "gpt", "value": "I'm doing well!"}
    ]
  }
]

Dependencies

Package	Version	Purpose
bs4 (BeautifulSoup)	--	HTML parsing
markdownify	0.11.6	HTML-to-Markdown conversion
tqdm	--	Progress bar for parallel processing

Usage Examples

Pipeline Usage (as part of prepare_all.py)

python3 -m fastchat.data.clean_sharegpt \
    --in ~/datasets/sharegpt_20230521_html.json \
    --out ~/datasets/sharegpt_20230521_4k_clean.json

Programmatic Usage

import json
from fastchat.data.clean_sharegpt import clean_html_all

content = json.load(open("sharegpt_html.json", "r"))
cleaned = clean_html_all(content, begin=0, end=len(content))
json.dump(cleaned, open("sharegpt_clean.json", "w"), indent=2, ensure_ascii=False)

Single Sample Cleaning

from fastchat.data.clean_sharegpt import clean_html_one_sample

sample = {
    "id": "test_001",
    "conversations": [
        {"from": "human", "value": "<div>What is Python?</div>"},
        {"from": "gpt", "value": "<div><p>Python is a programming language.</p></div>"}
    ]
}
cleaned_sample, error_code = clean_html_one_sample(sample)
if error_code == 0:
    print("Cleaned successfully:", cleaned_sample)

Related Pages

Principle:Lm_sys_FastChat_ShareGPT_HTML_Cleaning
Principle:Lm_sys_FastChat_ShareGPT_HTML_Cleaning -- The principle that this implementation realizes
Implementation:Lm_sys_FastChat_Optional_Clean -- Next pipeline step: language-based filtering
Implementation:Lm_sys_FastChat_Split_Long_Conversation -- Downstream: conversation splitting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment