Workflow:LLMBook zh LLMBook zh github io Data Preprocessing Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, LLM_Ops |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
End-to-end data preprocessing pipeline for preparing raw text corpora for large language model pre-training, covering quality filtering, deduplication, privacy protection, and tokenizer construction.
Description
This workflow outlines the standard data preprocessing steps required before training a large language model. Raw text data collected from the internet or other sources must be cleaned, deduplicated, anonymized, and tokenized before it can be fed into a training pipeline. The workflow implements four sequential stages: language-based quality filtering using fastText classifiers, n-gram Jaccard similarity deduplication to remove near-duplicate lines, PII masking to redact sensitive information such as Chinese national ID card numbers, and Byte Pair Encoding (BPE) tokenizer training from scratch. Each stage progressively improves data quality and prepares it for efficient model consumption.
Usage
Execute this workflow when you have collected a raw text corpus (e.g., web crawl data, document collections) and need to prepare it for LLM pre-training. This is the mandatory first step before any model training can begin. The pipeline is particularly relevant when working with multilingual data that requires language identification, or with Chinese-language corpora that may contain PII such as national ID card numbers.
Execution Steps
Step 1: Quality Filtering
Evaluate each text passage for language quality using a pre-trained fastText language identification model. The classifier assigns confidence scores for each detected language. Passages where the highest-scoring language falls below a rejection threshold (e.g., 0.5) are labeled as unknown. Passages whose detected language does not match the desired language list are discarded entirely.
Key considerations:
- Load the fastText language identification model (lid.176.bin) once and reuse across all passages
- Configure the acceptance language list based on your target corpus language(s)
- Set an appropriate rejection threshold to filter out low-confidence or garbled text
Step 2: Deduplication
Remove near-duplicate lines within each document using n-gram Jaccard similarity. The text is split into lines, and each line is further split into n-grams based on punctuation and whitespace delimiters. Adjacent lines are compared by computing the Jaccard similarity of their n-gram sets. Lines whose similarity to the previous retained line exceeds the threshold (e.g., 0.95) are removed.
Key considerations:
- Use both Chinese and English punctuation as gram delimiters for multilingual support
- The default n-gram size is 5 and the similarity threshold is 0.95
- Only adjacent lines are compared, making this a linear-time pass through the document
- Retained lines are reassembled using the original line delimiter
Step 3: Privacy Filtering
Mask personally identifiable information (PII) in the text using regular expression matching. For Chinese-language corpora, this primarily involves detecting and redacting national ID card numbers (18-digit format with optional trailing letter). Matched PII strings are replaced with a placeholder token to prevent the model from memorizing sensitive data.
Key considerations:
- The regex pattern targets Chinese national ID card numbers specifically
- The replacement placeholder (e.g., MASKED IDCARD) should be consistent across the corpus
- Additional PII patterns (phone numbers, email addresses) can be added to the regex set as needed
Step 4: BPE Tokenizer Training
Build a Byte Pair Encoding (BPE) vocabulary from the cleaned corpus. The algorithm starts with individual characters as initial tokens, then iteratively finds and merges the most frequent adjacent character pairs. Each merge creates a new token in the vocabulary. The process continues for a specified number of merge operations, producing a vocabulary that efficiently encodes common subword patterns.
Key considerations:
- Each text is first split into individual characters with a special end-of-word marker
- The number of merge operations determines the final vocabulary size
- The algorithm greedily selects the globally most frequent pair at each step
- The resulting vocabulary captures morphological patterns and common substrings