Principle:Lm sys FastChat ShareGPT HTML Cleaning
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | ShareGPT HTML Cleaning |
| Repository | lm-sys/FastChat |
| Knowledge Sources | Source Code Analysis, API Documentation |
| Domains | Data Preprocessing, NLP Pipeline, HTML Parsing |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
ShareGPT HTML Cleaning is the foundational data preprocessing principle in the FastChat ShareGPT Data Pipeline. Raw conversations scraped from ShareGPT contain HTML markup that must be converted to clean markdown before the data can be used for language model fine-tuning. This principle governs the entire first stage of the pipeline: HTML-to-markdown conversion, code block preservation, blocked word filtering, duplicate removal, and role alternation validation.
Description
The ShareGPT dataset consists of user-shared ChatGPT conversations exported as HTML. Before these conversations can serve as training data for models like Vicuna, the HTML content must undergo rigorous cleaning. This principle encompasses several sub-concerns:
HTML-to-Markdown Conversion
Raw conversation values contain HTML tags such as <div> and <span> elements. The cleaning process strips these container elements first (since they interfere with code block indentation and underscore rendering), then uses BeautifulSoup (bs4) for HTML parsing and markdownify for converting the remaining HTML structure into well-formed Markdown text.
Code Block Preservation
ShareGPT's HTML export format encodes code blocks in a non-standard way, with language identifiers concatenated with "Copy code" strings. The cleaning process uses regex-based pattern matching to detect these malformed code blocks and reformat them into standard Markdown fenced code blocks (triple backtick syntax with language annotations). Patterns like ```\s*(.*?)(?:Copy code)+(.+?)\s*?``` are identified and reformatted to preserve both the programming language label and the code content.
Blocked Word Filtering
Conversations containing references to "openai" or "chatgpt" (case-insensitive) are filtered out entirely. This prevents the fine-tuned model from learning to identify itself as ChatGPT or reference OpenAI. Additionally, blocked responses from GPT (such as rate-limit error messages like "Too many requests in 1 hour") are detected and cause the entire conversation to be excluded.
Duplicate Removal
Deduplication operates at two levels:
- ID-based deduplication: Conversations with the same
idfield are detected and only the first occurrence is kept. - Content-based deduplication: Conversations whose first human message and first GPT response form an identical key pair are treated as value duplicates and removed.
Role Alternation Validation
Valid training conversations must strictly alternate between "human" and "gpt" roles. The cleaning process enforces this by:
- Trimming a leading non-human turn (if the first message is from "gpt", it is removed).
- Trimming a trailing human turn (if the last message is from "human", it is removed).
- Verifying that the remaining turns strictly alternate roles; any conversation that fails this check is discarded.
- Ensuring the final conversation has an even number of turns (complete human/gpt pairs).
Parallel Processing
The cleaning operation uses Python's ProcessPoolExecutor to parallelize per-conversation HTML cleaning across all available CPU cores. Each conversation is cleaned independently, and the results are aggregated for deduplication in the main process.
Usage
This principle is applied as the first step in the ShareGPT data pipeline, as orchestrated by fastchat/data/prepare_all.py. The pipeline command is:
python3 -m fastchat.data.clean_sharegpt --in sharegpt_html.json --out sharegpt_clean.json
The principle ensures that downstream pipeline stages (language filtering, conversation splitting, format validation) receive consistently structured, clean Markdown conversations free of HTML artifacts, duplicates, and blocked content.
Theoretical Basis
The need for HTML cleaning in NLP training data pipelines arises from the fundamental mismatch between web-scraped data formats and model training requirements. Language models trained on raw HTML learn to reproduce markup artifacts, degrading output quality. The specific design choices in this principle are grounded in:
- Data quality filtering: Removing blocked words prevents identity confusion in the fine-tuned model (a known issue in instruction-tuned LLMs).
- Deduplication: Duplicate training examples cause the model to overfit on specific patterns, reducing generalization. Both ID-level and content-level deduplication address different sources of redundancy.
- Format normalization: Converting HTML to Markdown provides a consistent, human-readable text format that aligns with the tokenization expectations of transformer-based language models.
- Structural validation: Enforcing strict role alternation ensures that the model learns proper conversational turn-taking, which is critical for chat-based fine-tuning objectives.
Related Pages
- Implementation:Lm_sys_FastChat_Clean_ShareGPT
- Implementation:Lm_sys_FastChat_Clean_ShareGPT -- The implementation that realizes this principle
- Principle:Lm_sys_FastChat_Language_Based_Filtering -- The next pipeline stage: language-based filtering
- Principle:Lm_sys_FastChat_Conversation_Format_Validation -- Further format validation downstream