Workflow:LLMBook zh LLMBook zh github io Data Preprocessing Pipeline

Knowledge Sources	LLMBook-zh fastText Language ID A Survey of Large Language Models
Domains	Data_Engineering, NLP, LLM_Ops
Last Updated	2026-02-08 04:30 GMT

Overview

End-to-end data preprocessing pipeline for preparing raw text corpora for large language model pre-training, covering quality filtering, deduplication, privacy protection, and tokenizer construction.

Description

This workflow outlines the standard data preprocessing steps required before training a large language model. Raw text data collected from the internet or other sources must be cleaned, deduplicated, anonymized, and tokenized before it can be fed into a training pipeline. The workflow implements four sequential stages: language-based quality filtering using fastText classifiers, n-gram Jaccard similarity deduplication to remove near-duplicate lines, PII masking to redact sensitive information such as Chinese national ID card numbers, and Byte Pair Encoding (BPE) tokenizer training from scratch. Each stage progressively improves data quality and prepares it for efficient model consumption.

Usage

Execute this workflow when you have collected a raw text corpus (e.g., web crawl data, document collections) and need to prepare it for LLM pre-training. This is the mandatory first step before any model training can begin. The pipeline is particularly relevant when working with multilingual data that requires language identification, or with Chinese-language corpora that may contain PII such as national ID card numbers.

Execution Steps

Step 1: Quality Filtering

Evaluate each text passage for language quality using a pre-trained fastText language identification model. The classifier assigns confidence scores for each detected language. Passages where the highest-scoring language falls below a rejection threshold (e.g., 0.5) are labeled as unknown. Passages whose detected language does not match the desired language list are discarded entirely.

Key considerations:

Load the fastText language identification model (lid.176.bin) once and reuse across all passages
Configure the acceptance language list based on your target corpus language(s)
Set an appropriate rejection threshold to filter out low-confidence or garbled text

Step 2: Deduplication

Remove near-duplicate lines within each document using n-gram Jaccard similarity. The text is split into lines, and each line is further split into n-grams based on punctuation and whitespace delimiters. Adjacent lines are compared by computing the Jaccard similarity of their n-gram sets. Lines whose similarity to the previous retained line exceeds the threshold (e.g., 0.95) are removed.

Key considerations:

Use both Chinese and English punctuation as gram delimiters for multilingual support
The default n-gram size is 5 and the similarity threshold is 0.95
Only adjacent lines are compared, making this a linear-time pass through the document
Retained lines are reassembled using the original line delimiter

Step 3: Privacy Filtering

Mask personally identifiable information (PII) in the text using regular expression matching. For Chinese-language corpora, this primarily involves detecting and redacting national ID card numbers (18-digit format with optional trailing letter). Matched PII strings are replaced with a placeholder token to prevent the model from memorizing sensitive data.

Key considerations:

The regex pattern targets Chinese national ID card numbers specifically
The replacement placeholder (e.g., MASKED IDCARD) should be consistent across the corpus
Additional PII patterns (phone numbers, email addresses) can be added to the regex set as needed

Step 4: BPE Tokenizer Training

Build a Byte Pair Encoding (BPE) vocabulary from the cleaned corpus. The algorithm starts with individual characters as initial tokens, then iteratively finds and merges the most frequent adjacent character pairs. Each merge creates a new token in the vocabulary. The process continues for a specified number of merge operations, producing a vocabulary that efficiently encodes common subword patterns.

Key considerations:

Each text is first split into individual characters with a special end-of-word marker
The number of merge operations determines the final vocabulary size
The algorithm greedily selects the globally most frequent pair at each step
The resulting vocabulary captures morphological patterns and common substrings

Execution Diagram

GitHub URL

Workflow Repository