Implementation:NVIDIA NeMo Curator Download Utils

Knowledge Sources	NVIDIA NeMo Curator
Domains	Text Processing, Language Detection, Encoding, Utilities
Last Updated	2026-02-14 00:00 GMT

Overview

The download utilities module provides helper functions for text processing during the download and extraction phase, including language detection via pycld2, HTML byte decoding, and control character removal.

Description

This module contains five utility functions that support the download extractors (especially Common Crawl) by handling two core concerns: detecting the language of text content and decoding raw HTML bytes from various character encodings into clean Python strings.

Language Detection Functions:

remove_control_characters(text) - Removes Unicode control characters (category "C") from text. These are non-printable characters that can interfere with language detection.
detect_language(text) - Low-level wrapper around pycld2.detect() that returns the full detection result tuple including reliability flag, byte count, and up to three detected languages with confidence scores.
lang_detect(text) - High-level language detection function. Calls detect_language and returns the most likely language name in uppercase (e.g., "ENGLISH"). If detection fails on the original text, it falls back to removing control characters and retrying.

HTML Decoding Functions:

decode_html(html_bytes) - Attempts to decode raw HTML bytes using UTF-8. If UTF-8 decoding fails, falls back to encoding detection.
try_decode_with_detected_encoding(html_bytes) - Uses charset_normalizer to detect the encoding of the bytes, then attempts to decode with the detected encoding. Returns None if detection fails or returns UTF-8 (since UTF-8 already failed).

Usage

Use these functions when processing raw web content that needs language detection or character encoding handling. They are primarily used by CommonCrawlHTMLExtractor but can be used independently for any HTML processing workflow.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/download/utils.py
Lines: 1-79

Signature

def remove_control_characters(text: str) -> str: ...

def detect_language(text: str) -> tuple[bool, int, list[tuple[str, str, float, int]]]: ...

def lang_detect(text: str) -> str: ...

def decode_html(html_bytes: bytes) -> str | None: ...

def try_decode_with_detected_encoding(html_bytes: bytes) -> str | None: ...

Import

from nemo_curator.stages.text.download.utils import (
    remove_control_characters,
    detect_language,
    lang_detect,
    decode_html,
    try_decode_with_detected_encoding,
)

I/O Contract

remove_control_characters

Name	Type	Required	Description
text	str	Yes	Input text potentially containing Unicode control characters

Returns: str - Text with all Unicode control characters (category "C") removed.

detect_language

Name	Type	Required	Description
text	str	Yes	Text to detect language from

Returns: tuple[bool, int, list[tuple[str, str, float, int]]] containing:

Field	Type	Description
is_reliable	bool	True if the detection is high confidence
textBytesFound	int	The number of bytes of text found
details	list[tuple]	Up to three detected languages, each as (language_name, language_code, percent, score)

lang_detect

Name	Type	Required	Description
text	str	Yes	Text to detect language from

Returns: str - The most likely language name in uppercase (e.g., "ENGLISH", "FRENCH").

decode_html

Name	Type	Required	Description
html_bytes	bytes	Yes	Raw HTML content as bytes

Returns: str | None - Decoded HTML string, or None if decoding fails with all attempted encodings.

try_decode_with_detected_encoding

Name	Type	Required	Description
html_bytes	bytes	Yes	Raw HTML content as bytes that failed UTF-8 decoding

Returns: str | None - Decoded HTML string using detected encoding, or None if detection fails.

Usage Examples

Language Detection

from nemo_curator.stages.text.download.utils import lang_detect

language = lang_detect("This is an English text about natural language processing.")
print(language)  # Output: "ENGLISH"

HTML Decoding

from nemo_curator.stages.text.download.utils import decode_html

# UTF-8 encoded bytes
html = decode_html(b"<html><body>Hello World</body></html>")
print(html)  # Output: "<html><body>Hello World</body></html>"

# Non-UTF-8 bytes will be detected and decoded
latin1_bytes = "Caf\xe9".encode("latin-1")
html = decode_html(latin1_bytes)

Full Detection Pipeline

from nemo_curator.stages.text.download.utils import detect_language

is_reliable, bytes_found, details = detect_language("Bonjour le monde")
print(f"Reliable: {is_reliable}")
print(f"Top language: {details[0][0]}")  # "FRENCH"
print(f"Confidence: {details[0][2]}%")

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_CommonCrawl_Extractor - Primary consumer of these utilities for language detection and HTML decoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment