Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Lm sys FastChat ShareGPT HTML Cleaning

From Leeroopedia


Field Value
Page Type Principle
Title ShareGPT HTML Cleaning
Repository lm-sys/FastChat
Knowledge Sources Source Code Analysis, API Documentation
Domains Data Preprocessing, NLP Pipeline, HTML Parsing
Last Updated 2026-02-07 14:00 GMT

Overview

ShareGPT HTML Cleaning is the foundational data preprocessing principle in the FastChat ShareGPT Data Pipeline. Raw conversations scraped from ShareGPT contain HTML markup that must be converted to clean markdown before the data can be used for language model fine-tuning. This principle governs the entire first stage of the pipeline: HTML-to-markdown conversion, code block preservation, blocked word filtering, duplicate removal, and role alternation validation.

Description

The ShareGPT dataset consists of user-shared ChatGPT conversations exported as HTML. Before these conversations can serve as training data for models like Vicuna, the HTML content must undergo rigorous cleaning. This principle encompasses several sub-concerns:

HTML-to-Markdown Conversion

Raw conversation values contain HTML tags such as <div> and <span> elements. The cleaning process strips these container elements first (since they interfere with code block indentation and underscore rendering), then uses BeautifulSoup (bs4) for HTML parsing and markdownify for converting the remaining HTML structure into well-formed Markdown text.

Code Block Preservation

ShareGPT's HTML export format encodes code blocks in a non-standard way, with language identifiers concatenated with "Copy code" strings. The cleaning process uses regex-based pattern matching to detect these malformed code blocks and reformat them into standard Markdown fenced code blocks (triple backtick syntax with language annotations). Patterns like ```\s*(.*?)(?:Copy code)+(.+?)\s*?``` are identified and reformatted to preserve both the programming language label and the code content.

Blocked Word Filtering

Conversations containing references to "openai" or "chatgpt" (case-insensitive) are filtered out entirely. This prevents the fine-tuned model from learning to identify itself as ChatGPT or reference OpenAI. Additionally, blocked responses from GPT (such as rate-limit error messages like "Too many requests in 1 hour") are detected and cause the entire conversation to be excluded.

Duplicate Removal

Deduplication operates at two levels:

  • ID-based deduplication: Conversations with the same id field are detected and only the first occurrence is kept.
  • Content-based deduplication: Conversations whose first human message and first GPT response form an identical key pair are treated as value duplicates and removed.

Role Alternation Validation

Valid training conversations must strictly alternate between "human" and "gpt" roles. The cleaning process enforces this by:

  • Trimming a leading non-human turn (if the first message is from "gpt", it is removed).
  • Trimming a trailing human turn (if the last message is from "human", it is removed).
  • Verifying that the remaining turns strictly alternate roles; any conversation that fails this check is discarded.
  • Ensuring the final conversation has an even number of turns (complete human/gpt pairs).

Parallel Processing

The cleaning operation uses Python's ProcessPoolExecutor to parallelize per-conversation HTML cleaning across all available CPU cores. Each conversation is cleaned independently, and the results are aggregated for deduplication in the main process.

Usage

This principle is applied as the first step in the ShareGPT data pipeline, as orchestrated by fastchat/data/prepare_all.py. The pipeline command is:

python3 -m fastchat.data.clean_sharegpt --in sharegpt_html.json --out sharegpt_clean.json

The principle ensures that downstream pipeline stages (language filtering, conversation splitting, format validation) receive consistently structured, clean Markdown conversations free of HTML artifacts, duplicates, and blocked content.

Theoretical Basis

The need for HTML cleaning in NLP training data pipelines arises from the fundamental mismatch between web-scraped data formats and model training requirements. Language models trained on raw HTML learn to reproduce markup artifacts, degrading output quality. The specific design choices in this principle are grounded in:

  • Data quality filtering: Removing blocked words prevents identity confusion in the fine-tuned model (a known issue in instruction-tuned LLMs).
  • Deduplication: Duplicate training examples cause the model to overfit on specific patterns, reducing generalization. Both ID-level and content-level deduplication address different sources of redundancy.
  • Format normalization: Converting HTML to Markdown provides a consistent, human-readable text format that aligns with the tokenization expectations of transformer-based language models.
  • Structural validation: Enforcing strict role alternation ensures that the model learns proper conversational turn-taking, which is critical for chat-based fine-tuning objectives.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment