Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Lm sys FastChat Clean ShareGPT

From Leeroopedia


Field Value
Page Type Implementation
Title Clean ShareGPT
Repository lm-sys/FastChat
Knowledge Sources Source Code Analysis, API Documentation
Domains Data Preprocessing, NLP Pipeline, HTML Parsing
Last Updated 2026-02-07 14:00 GMT

Overview

Clean ShareGPT is the implementation module responsible for converting raw HTML-formatted ShareGPT conversations into clean Markdown text suitable for language model fine-tuning. It handles HTML-to-Markdown conversion, code block reformatting, blocked word and response filtering, deduplication (by ID and content hash), role alternation enforcement, and parallel processing via ProcessPoolExecutor.

Description

This module implements the first stage of the FastChat ShareGPT data pipeline. It reads a JSON file containing raw ShareGPT conversations with HTML-encoded content, processes each conversation in parallel to convert HTML to Markdown, filters out invalid or blocked conversations, removes duplicates, and writes the cleaned result to an output JSON file.

The module defines several regex patterns at module scope for detecting and reformatting code blocks, removing ShareGPT UI artifacts (like "Copy code" buttons and regeneration counters), and stripping unnecessary HTML container elements.

Usage

CLI Invocation

python3 -m fastchat.data.clean_sharegpt --in sharegpt_html.json --out sharegpt_clean.json

CLI Parameters

Parameter Type Required Default Description
--in-file str Yes -- Path to input JSON file with raw HTML conversations
--out-file str No sharegpt_clean.json Path to output cleaned JSON file
--begin int No None Start index for slicing input content
--end int No None End index for slicing input content
--debug flag No False Enable debug mode

Programmatic Import

from fastchat.data.clean_sharegpt import clean_html_all

Code Reference

Source Location

Item Location
Module fastchat/data/clean_sharegpt.py
Core Functions Lines 86-215
Repository github.com/lm-sys/FastChat

Function Signatures

def clean_html_all(content, begin, end) -> list[dict]:
    """
    Clean the source html files.
    Processes all conversations in parallel using ProcessPoolExecutor,
    then deduplicates by ID and by content (first human + first gpt value).
    Returns a list of cleaned conversation dicts.
    """

def clean_html_one_sample(sample) -> tuple[dict, int]:
    """
    Clean one conversation sample. Converts HTML to markdown,
    filters blocked words/responses, validates role alternation.
    Returns (sample, error_code) where error_code:
      0 = success
      1 = too short / empty
      2 = wrong role format
      3 = blocked words or blocked response
      4 = parser error (bs4/markdownify)
    """

def html_to_markdown(val: str) -> str:
    """
    Convert HTML string to Markdown.
    Strips <div> and <span> tags, applies markdownify, reformats code blocks,
    removes ShareGPT UI artifacts (regeneration counters, 'Copy code' noise).
    """

def reformat_code(val: str) -> str:
    """
    Reformat code blocks from ShareGPT's HTML export format
    into standard Markdown fenced code blocks.
    """

def contain_blocked_words(val: str) -> bool:
    """
    Returns True if val contains 'openai' or 'chatgpt' (case-insensitive).
    """

def contain_blocked_responses(role: str, val: str) -> bool:
    """
    Returns True if role is 'gpt' and val starts with a known
    blocked response (e.g., rate-limit error messages).
    """

Import

from fastchat.data.clean_sharegpt import clean_html_all
from fastchat.data.clean_sharegpt import clean_html_one_sample
from fastchat.data.clean_sharegpt import html_to_markdown

I/O Contract

Inputs

Input Type Description
in_file JSON file Raw ShareGPT JSON: a list of dicts, each with "id" (str) and "conversations" (list of {"from": str, "value": str}). The "value" fields contain raw HTML content.

Example input structure:

[
  {
    "id": "abc123",
    "conversations": [
      {"from": "human", "value": "<div><p>Hello, how are you?</p></div>"},
      {"from": "gpt", "value": "<div><p>I'm doing well!</p></div>"}
    ]
  }
]

Outputs

Output Type Description
out_file JSON file Cleaned JSON: same structure but with HTML converted to Markdown, duplicates removed, blocked content filtered, and invalid role sequences discarded.

Example output structure:

[
  {
    "id": "abc123",
    "conversations": [
      {"from": "human", "value": "Hello, how are you?"},
      {"from": "gpt", "value": "I'm doing well!"}
    ]
  }
]

Dependencies

Package Version Purpose
bs4 (BeautifulSoup) -- HTML parsing
markdownify 0.11.6 HTML-to-Markdown conversion
tqdm -- Progress bar for parallel processing

Usage Examples

Pipeline Usage (as part of prepare_all.py)

python3 -m fastchat.data.clean_sharegpt \
    --in ~/datasets/sharegpt_20230521_html.json \
    --out ~/datasets/sharegpt_20230521_4k_clean.json

Programmatic Usage

import json
from fastchat.data.clean_sharegpt import clean_html_all

content = json.load(open("sharegpt_html.json", "r"))
cleaned = clean_html_all(content, begin=0, end=len(content))
json.dump(cleaned, open("sharegpt_clean.json", "w"), indent=2, ensure_ascii=False)

Single Sample Cleaning

from fastchat.data.clean_sharegpt import clean_html_one_sample

sample = {
    "id": "test_001",
    "conversations": [
        {"from": "human", "value": "<div>What is Python?</div>"},
        {"from": "gpt", "value": "<div><p>Python is a programming language.</p></div>"}
    ]
}
cleaned_sample, error_code = clean_html_one_sample(sample)
if error_code == 0:
    print("Cleaned successfully:", cleaned_sample)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment