Principle:PacktPublishing LLM Engineers Handbook Document Cleaning

Concept	Text preprocessing / document normalization
Workflow	Feature_Engineering
Pipeline Stage	Data Preprocessing
Repository	PacktPublishing/LLM-Engineers-Handbook
Implemented By	Implementation:PacktPublishing_LLM_Engineers_Handbook_CleaningDispatcher_Dispatch

Overview

Document Cleaning is a critical preprocessing step in NLP and LLM pipelines that transforms raw, noisy text into a clean, normalized form suitable for downstream processing. Raw documents crawled from the web often contain HTML artifacts, inconsistent whitespace, special characters, encoding issues, and other noise that can degrade the quality of embeddings and fine-tuning data.

Theory

Data Cleaning in NLP pipelines involves transforming raw, noisy text into a clean, normalized representation. The goal is to remove irrelevant artifacts while preserving the semantic content of the document. This step directly impacts the quality of all downstream outputs — embeddings generated from noisy text will encode noise, and fine-tuning datasets built from uncleaned documents will teach the model to reproduce that noise.

Dispatcher (Factory) Pattern

The Document Cleaning principle uses the Dispatcher pattern (a variant of the Factory pattern) to route documents of different categories to specialized cleaning handlers. This is necessary because different document types require different cleaning strategies:

Articles — May contain HTML tags, navigation elements, advertisement text, and formatting artifacts from web scraping
Social media posts — May contain hashtags, mentions, emoji sequences, URL shorteners, and platform-specific formatting
Code repositories — May contain build artifacts, binary file references, auto-generated comments, and non-code content

By dispatching to category-specific handlers, each handler can implement cleaning logic tailored to its document type without polluting a single monolithic cleaning function.

Common Cleaning Operations

Typical operations in a document cleaning pipeline include:

HTML stripping — Removing HTML tags and entities while preserving text content
Whitespace normalization — Collapsing multiple spaces, tabs, and newlines into single spaces or standardized line breaks
Special character removal — Stripping non-printable characters, control characters, and encoding artifacts
Unicode normalization — Converting text to a consistent Unicode normalization form (e.g., NFC or NFKC)
Content extraction — Removing boilerplate content such as headers, footers, navigation menus, and advertisements

Why Cleaning Matters

The quality of cleaned documents has a cascading effect on the entire feature engineering pipeline:

Chunking quality — Noisy text produces poor chunk boundaries, leading to semantically incoherent segments
Embedding quality — Embeddings encode noise alongside signal, reducing retrieval accuracy in RAG systems
Fine-tuning quality — Training data with artifacts teaches the model to generate noisy outputs
Token efficiency — Noise consumes tokens in the context window without contributing useful information

Design Considerations

Immutability — Cleaning produces new document objects (e.g., CleanedArticleDocument) rather than mutating the originals, preserving the raw data for debugging and reprocessing.
Type transformation — The cleaning step transforms NoSQLBaseDocument instances (raw) into VectorBaseDocument instances (cleaned), marking a transition from the document store domain to the vector store domain.
Extensibility — New document categories can be supported by adding a new cleaning handler and registering it in the dispatcher's factory dictionary.
Idempotency — Cleaning operations should be idempotent: applying them multiple times should produce the same result as applying them once.

Usage

Use the Document Cleaning pattern when:

Raw crawled documents need normalization before chunking, embedding, or dataset generation
Different document types require different cleaning strategies
You need to preserve raw documents while creating cleaned derivatives
Building a preprocessing pipeline where data quality directly impacts model performance

Example

from llm_engineering.application.preprocessing.dispatchers import CleaningDispatcher
from llm_engineering.domain.documents import ArticleDocument

# Retrieve raw documents
raw_articles = ArticleDocument.bulk_find(author_id=author_uuid)

# Clean each document through the dispatcher
cleaned_documents = []
for article in raw_articles:
    cleaned = CleaningDispatcher.dispatch(article)
    cleaned_documents.append(cleaned)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment