Workflow:Lm sys FastChat ShareGPT Data Pipeline

Knowledge Sources	FastChat BeautifulSoup HuggingFace Tokenizers
Domains	Data_Engineering, NLP, LLMs
Last Updated	2026-02-07 04:00 GMT

Overview

End-to-end data cleaning and preparation pipeline for transforming raw ShareGPT conversation data into training-ready datasets for Vicuna-style fine-tuning.

Description

This workflow implements the complete data processing chain used to prepare ShareGPT conversation data for supervised fine-tuning. The pipeline converts HTML-formatted conversations to clean markdown, filters by language, splits long conversations to fit within model context windows, removes malformed entries, creates train/test splits, injects identity-aware hardcoded Q&A pairs, and merges datasets. The orchestrator script (prepare_all.py) runs all steps sequentially. Individual steps can also be run independently for custom pipeline configurations.

Usage

Execute this workflow when you have raw ShareGPT-format conversation data (typically containing HTML markup) and need to produce clean, properly formatted training data for Vicuna or similar chat model fine-tuning. This is a prerequisite step before running the SFT or LoRA fine-tuning workflows with real data.

Execution Steps

Step 1: Install Dependencies

Install the data processing dependencies required for HTML cleaning and language detection. These are additional packages beyond the base FastChat installation.

Key considerations:

Required packages: bs4 (BeautifulSoup), markdownify for HTML-to-markdown conversion
Language detection requires: polyglot, pyicu, pycld2
These can be installed via pip alongside the base FastChat package

Step 2: HTML Cleaning

Convert HTML-formatted conversation content to clean markdown text. The cleaner processes each message in each conversation, stripping HTML tags while preserving semantic structure through markdown formatting. Invalid or empty messages are removed.

Key considerations:

Input: Raw ShareGPT JSON with HTML-formatted messages
The cleaner uses BeautifulSoup for HTML parsing and markdownify for conversion
Conversations with no valid content after cleaning are dropped
Special HTML entities and formatting artifacts are normalized

Step 3: Language Filtering

Filter conversations by detected language. This step uses polyglot/pycld2 for language identification and can either keep or remove conversations in specified languages. This is useful for creating monolingual training sets or excluding low-resource languages.

Key considerations:

Uses pycld2 for language detection on conversation text
The --skip-lang flag specifies language codes to exclude (e.g., "ko" for Korean)
Language detection operates on the concatenated conversation text
Conversations that fail language detection are optionally retained or dropped

Step 4: Long Conversation Splitting

Split conversations that exceed the model's maximum sequence length into multiple shorter conversations. The splitter uses the target model's tokenizer to measure actual token counts and finds optimal split points at conversation turn boundaries.

Key considerations:

Requires a model tokenizer (--model-name) for accurate token counting
The --max-length parameter sets the target sequence length (default matches model context)
Splits occur at natural turn boundaries to preserve conversation coherence
Each split conversation retains the system prompt if present

Step 5: Format Validation

Filter out conversations with malformed structure. This step removes entries where the conversation format does not alternate correctly between human and assistant roles, or where required fields are missing.

Key considerations:

Validates that conversations follow the expected alternating role pattern
Removes entries with empty or null message values
Ensures the conversation list structure is well-formed

Step 6: Train/Test Split

Divide the cleaned dataset into training and test partitions. The default split ratio is 99% training and 1% test, producing two separate output files.

Key considerations:

Default ratio: 0.99 (99% train, 1% test)
The split is randomized
Output files are named with _train and _test suffixes

Step 7: Identity Data Injection and Merge

Generate hardcoded identity Q&A pairs (questions like "What is your name?") and merge them with the training data. This ensures the model correctly identifies itself as Vicuna rather than claiming to be the base model or another AI.

Key considerations:

Hardcoded questions cover identity, capabilities, and limitations
The hardcoded dataset is generated independently then merged with training data
Additional extraction steps can produce GPT-4-only subsets or single-round datasets
The merge step concatenates multiple JSON datasets into a single training file

Execution Diagram

GitHub URL

Workflow Repository