Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:PacktPublishing LLM Engineers Handbook Document Cleaning

From Leeroopedia


Concept Text preprocessing / document normalization
Workflow Feature_Engineering
Pipeline Stage Data Preprocessing
Repository PacktPublishing/LLM-Engineers-Handbook
Implemented By Implementation:PacktPublishing_LLM_Engineers_Handbook_CleaningDispatcher_Dispatch

Overview

Document Cleaning is a critical preprocessing step in NLP and LLM pipelines that transforms raw, noisy text into a clean, normalized form suitable for downstream processing. Raw documents crawled from the web often contain HTML artifacts, inconsistent whitespace, special characters, encoding issues, and other noise that can degrade the quality of embeddings and fine-tuning data.

Theory

Data Cleaning in NLP pipelines involves transforming raw, noisy text into a clean, normalized representation. The goal is to remove irrelevant artifacts while preserving the semantic content of the document. This step directly impacts the quality of all downstream outputs — embeddings generated from noisy text will encode noise, and fine-tuning datasets built from uncleaned documents will teach the model to reproduce that noise.

Dispatcher (Factory) Pattern

The Document Cleaning principle uses the Dispatcher pattern (a variant of the Factory pattern) to route documents of different categories to specialized cleaning handlers. This is necessary because different document types require different cleaning strategies:

  • Articles — May contain HTML tags, navigation elements, advertisement text, and formatting artifacts from web scraping
  • Social media posts — May contain hashtags, mentions, emoji sequences, URL shorteners, and platform-specific formatting
  • Code repositories — May contain build artifacts, binary file references, auto-generated comments, and non-code content

By dispatching to category-specific handlers, each handler can implement cleaning logic tailored to its document type without polluting a single monolithic cleaning function.

Common Cleaning Operations

Typical operations in a document cleaning pipeline include:

  • HTML stripping — Removing HTML tags and entities while preserving text content
  • Whitespace normalization — Collapsing multiple spaces, tabs, and newlines into single spaces or standardized line breaks
  • Special character removal — Stripping non-printable characters, control characters, and encoding artifacts
  • Unicode normalization — Converting text to a consistent Unicode normalization form (e.g., NFC or NFKC)
  • Content extraction — Removing boilerplate content such as headers, footers, navigation menus, and advertisements

Why Cleaning Matters

The quality of cleaned documents has a cascading effect on the entire feature engineering pipeline:

  1. Chunking quality — Noisy text produces poor chunk boundaries, leading to semantically incoherent segments
  2. Embedding quality — Embeddings encode noise alongside signal, reducing retrieval accuracy in RAG systems
  3. Fine-tuning quality — Training data with artifacts teaches the model to generate noisy outputs
  4. Token efficiency — Noise consumes tokens in the context window without contributing useful information

Design Considerations

  • Immutability — Cleaning produces new document objects (e.g., CleanedArticleDocument) rather than mutating the originals, preserving the raw data for debugging and reprocessing.
  • Type transformation — The cleaning step transforms NoSQLBaseDocument instances (raw) into VectorBaseDocument instances (cleaned), marking a transition from the document store domain to the vector store domain.
  • Extensibility — New document categories can be supported by adding a new cleaning handler and registering it in the dispatcher's factory dictionary.
  • Idempotency — Cleaning operations should be idempotent: applying them multiple times should produce the same result as applying them once.

Usage

Use the Document Cleaning pattern when:

  • Raw crawled documents need normalization before chunking, embedding, or dataset generation
  • Different document types require different cleaning strategies
  • You need to preserve raw documents while creating cleaned derivatives
  • Building a preprocessing pipeline where data quality directly impacts model performance

Example

from llm_engineering.application.preprocessing.dispatchers import CleaningDispatcher
from llm_engineering.domain.documents import ArticleDocument

# Retrieve raw documents
raw_articles = ArticleDocument.bulk_find(author_id=author_uuid)

# Clean each document through the dispatcher
cleaned_documents = []
for article in raw_articles:
    cleaned = CleaningDispatcher.dispatch(article)
    cleaned_documents.append(cleaned)

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment