Principle:PacktPublishing LLM Engineers Handbook Document Cleaning
| Concept | Text preprocessing / document normalization |
|---|---|
| Workflow | Feature_Engineering |
| Pipeline Stage | Data Preprocessing |
| Repository | PacktPublishing/LLM-Engineers-Handbook |
| Implemented By | Implementation:PacktPublishing_LLM_Engineers_Handbook_CleaningDispatcher_Dispatch |
Overview
Document Cleaning is a critical preprocessing step in NLP and LLM pipelines that transforms raw, noisy text into a clean, normalized form suitable for downstream processing. Raw documents crawled from the web often contain HTML artifacts, inconsistent whitespace, special characters, encoding issues, and other noise that can degrade the quality of embeddings and fine-tuning data.
Theory
Data Cleaning in NLP pipelines involves transforming raw, noisy text into a clean, normalized representation. The goal is to remove irrelevant artifacts while preserving the semantic content of the document. This step directly impacts the quality of all downstream outputs — embeddings generated from noisy text will encode noise, and fine-tuning datasets built from uncleaned documents will teach the model to reproduce that noise.
Dispatcher (Factory) Pattern
The Document Cleaning principle uses the Dispatcher pattern (a variant of the Factory pattern) to route documents of different categories to specialized cleaning handlers. This is necessary because different document types require different cleaning strategies:
- Articles — May contain HTML tags, navigation elements, advertisement text, and formatting artifacts from web scraping
- Social media posts — May contain hashtags, mentions, emoji sequences, URL shorteners, and platform-specific formatting
- Code repositories — May contain build artifacts, binary file references, auto-generated comments, and non-code content
By dispatching to category-specific handlers, each handler can implement cleaning logic tailored to its document type without polluting a single monolithic cleaning function.
Common Cleaning Operations
Typical operations in a document cleaning pipeline include:
- HTML stripping — Removing HTML tags and entities while preserving text content
- Whitespace normalization — Collapsing multiple spaces, tabs, and newlines into single spaces or standardized line breaks
- Special character removal — Stripping non-printable characters, control characters, and encoding artifacts
- Unicode normalization — Converting text to a consistent Unicode normalization form (e.g., NFC or NFKC)
- Content extraction — Removing boilerplate content such as headers, footers, navigation menus, and advertisements
Why Cleaning Matters
The quality of cleaned documents has a cascading effect on the entire feature engineering pipeline:
- Chunking quality — Noisy text produces poor chunk boundaries, leading to semantically incoherent segments
- Embedding quality — Embeddings encode noise alongside signal, reducing retrieval accuracy in RAG systems
- Fine-tuning quality — Training data with artifacts teaches the model to generate noisy outputs
- Token efficiency — Noise consumes tokens in the context window without contributing useful information
Design Considerations
- Immutability — Cleaning produces new document objects (e.g.,
CleanedArticleDocument) rather than mutating the originals, preserving the raw data for debugging and reprocessing. - Type transformation — The cleaning step transforms
NoSQLBaseDocumentinstances (raw) intoVectorBaseDocumentinstances (cleaned), marking a transition from the document store domain to the vector store domain. - Extensibility — New document categories can be supported by adding a new cleaning handler and registering it in the dispatcher's factory dictionary.
- Idempotency — Cleaning operations should be idempotent: applying them multiple times should produce the same result as applying them once.
Usage
Use the Document Cleaning pattern when:
- Raw crawled documents need normalization before chunking, embedding, or dataset generation
- Different document types require different cleaning strategies
- You need to preserve raw documents while creating cleaned derivatives
- Building a preprocessing pipeline where data quality directly impacts model performance
Example
from llm_engineering.application.preprocessing.dispatchers import CleaningDispatcher
from llm_engineering.domain.documents import ArticleDocument
# Retrieve raw documents
raw_articles = ArticleDocument.bulk_find(author_id=author_uuid)
# Clean each document through the dispatcher
cleaned_documents = []
for article in raw_articles:
cleaned = CleaningDispatcher.dispatch(article)
cleaned_documents.append(cleaned)