Implementation:NVIDIA NeMo Curator Download Utils
| Knowledge Sources | |
|---|---|
| Domains | Text Processing, Language Detection, Encoding, Utilities |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
The download utilities module provides helper functions for text processing during the download and extraction phase, including language detection via pycld2, HTML byte decoding, and control character removal.
Description
This module contains five utility functions that support the download extractors (especially Common Crawl) by handling two core concerns: detecting the language of text content and decoding raw HTML bytes from various character encodings into clean Python strings.
Language Detection Functions:
remove_control_characters(text)- Removes Unicode control characters (category "C") from text. These are non-printable characters that can interfere with language detection.detect_language(text)- Low-level wrapper aroundpycld2.detect()that returns the full detection result tuple including reliability flag, byte count, and up to three detected languages with confidence scores.lang_detect(text)- High-level language detection function. Callsdetect_languageand returns the most likely language name in uppercase (e.g., "ENGLISH"). If detection fails on the original text, it falls back to removing control characters and retrying.
HTML Decoding Functions:
decode_html(html_bytes)- Attempts to decode raw HTML bytes using UTF-8. If UTF-8 decoding fails, falls back to encoding detection.try_decode_with_detected_encoding(html_bytes)- Usescharset_normalizerto detect the encoding of the bytes, then attempts to decode with the detected encoding. ReturnsNoneif detection fails or returns UTF-8 (since UTF-8 already failed).
Usage
Use these functions when processing raw web content that needs language detection or character encoding handling. They are primarily used by CommonCrawlHTMLExtractor but can be used independently for any HTML processing workflow.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/download/utils.py - Lines: 1-79
Signature
def remove_control_characters(text: str) -> str: ...
def detect_language(text: str) -> tuple[bool, int, list[tuple[str, str, float, int]]]: ...
def lang_detect(text: str) -> str: ...
def decode_html(html_bytes: bytes) -> str | None: ...
def try_decode_with_detected_encoding(html_bytes: bytes) -> str | None: ...
Import
from nemo_curator.stages.text.download.utils import (
remove_control_characters,
detect_language,
lang_detect,
decode_html,
try_decode_with_detected_encoding,
)
I/O Contract
remove_control_characters
| Name | Type | Required | Description |
|---|---|---|---|
| text | str | Yes | Input text potentially containing Unicode control characters |
Returns: str - Text with all Unicode control characters (category "C") removed.
detect_language
| Name | Type | Required | Description |
|---|---|---|---|
| text | str | Yes | Text to detect language from |
Returns: tuple[bool, int, list[tuple[str, str, float, int]]] containing:
| Field | Type | Description |
|---|---|---|
| is_reliable | bool | True if the detection is high confidence |
| textBytesFound | int | The number of bytes of text found |
| details | list[tuple] | Up to three detected languages, each as (language_name, language_code, percent, score) |
lang_detect
| Name | Type | Required | Description |
|---|---|---|---|
| text | str | Yes | Text to detect language from |
Returns: str - The most likely language name in uppercase (e.g., "ENGLISH", "FRENCH").
decode_html
| Name | Type | Required | Description |
|---|---|---|---|
| html_bytes | bytes | Yes | Raw HTML content as bytes |
Returns: str | None - Decoded HTML string, or None if decoding fails with all attempted encodings.
try_decode_with_detected_encoding
| Name | Type | Required | Description |
|---|---|---|---|
| html_bytes | bytes | Yes | Raw HTML content as bytes that failed UTF-8 decoding |
Returns: str | None - Decoded HTML string using detected encoding, or None if detection fails.
Usage Examples
Language Detection
from nemo_curator.stages.text.download.utils import lang_detect
language = lang_detect("This is an English text about natural language processing.")
print(language) # Output: "ENGLISH"
HTML Decoding
from nemo_curator.stages.text.download.utils import decode_html
# UTF-8 encoded bytes
html = decode_html(b"<html><body>Hello World</body></html>")
print(html) # Output: "<html><body>Hello World</body></html>"
# Non-UTF-8 bytes will be detected and decoded
latin1_bytes = "Caf\xe9".encode("latin-1")
html = decode_html(latin1_bytes)
Full Detection Pipeline
from nemo_curator.stages.text.download.utils import detect_language
is_reliable, bytes_found, details = detect_language("Bonjour le monde")
print(f"Reliable: {is_reliable}")
print(f"Top language: {details[0][0]}") # "FRENCH"
print(f"Confidence: {details[0][2]}%")
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_CommonCrawl_Extractor - Primary consumer of these utilities for language detection and HTML decoding