Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator Download Utils

From Leeroopedia
Knowledge Sources
Domains Text Processing, Language Detection, Encoding, Utilities
Last Updated 2026-02-14 00:00 GMT

Overview

The download utilities module provides helper functions for text processing during the download and extraction phase, including language detection via pycld2, HTML byte decoding, and control character removal.

Description

This module contains five utility functions that support the download extractors (especially Common Crawl) by handling two core concerns: detecting the language of text content and decoding raw HTML bytes from various character encodings into clean Python strings.

Language Detection Functions:

  • remove_control_characters(text) - Removes Unicode control characters (category "C") from text. These are non-printable characters that can interfere with language detection.
  • detect_language(text) - Low-level wrapper around pycld2.detect() that returns the full detection result tuple including reliability flag, byte count, and up to three detected languages with confidence scores.
  • lang_detect(text) - High-level language detection function. Calls detect_language and returns the most likely language name in uppercase (e.g., "ENGLISH"). If detection fails on the original text, it falls back to removing control characters and retrying.

HTML Decoding Functions:

  • decode_html(html_bytes) - Attempts to decode raw HTML bytes using UTF-8. If UTF-8 decoding fails, falls back to encoding detection.
  • try_decode_with_detected_encoding(html_bytes) - Uses charset_normalizer to detect the encoding of the bytes, then attempts to decode with the detected encoding. Returns None if detection fails or returns UTF-8 (since UTF-8 already failed).

Usage

Use these functions when processing raw web content that needs language detection or character encoding handling. They are primarily used by CommonCrawlHTMLExtractor but can be used independently for any HTML processing workflow.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/download/utils.py
  • Lines: 1-79

Signature

def remove_control_characters(text: str) -> str: ...

def detect_language(text: str) -> tuple[bool, int, list[tuple[str, str, float, int]]]: ...

def lang_detect(text: str) -> str: ...

def decode_html(html_bytes: bytes) -> str | None: ...

def try_decode_with_detected_encoding(html_bytes: bytes) -> str | None: ...

Import

from nemo_curator.stages.text.download.utils import (
    remove_control_characters,
    detect_language,
    lang_detect,
    decode_html,
    try_decode_with_detected_encoding,
)

I/O Contract

remove_control_characters

Name Type Required Description
text str Yes Input text potentially containing Unicode control characters

Returns: str - Text with all Unicode control characters (category "C") removed.

detect_language

Name Type Required Description
text str Yes Text to detect language from

Returns: tuple[bool, int, list[tuple[str, str, float, int]]] containing:

Field Type Description
is_reliable bool True if the detection is high confidence
textBytesFound int The number of bytes of text found
details list[tuple] Up to three detected languages, each as (language_name, language_code, percent, score)

lang_detect

Name Type Required Description
text str Yes Text to detect language from

Returns: str - The most likely language name in uppercase (e.g., "ENGLISH", "FRENCH").

decode_html

Name Type Required Description
html_bytes bytes Yes Raw HTML content as bytes

Returns: str | None - Decoded HTML string, or None if decoding fails with all attempted encodings.

try_decode_with_detected_encoding

Name Type Required Description
html_bytes bytes Yes Raw HTML content as bytes that failed UTF-8 decoding

Returns: str | None - Decoded HTML string using detected encoding, or None if detection fails.

Usage Examples

Language Detection

from nemo_curator.stages.text.download.utils import lang_detect

language = lang_detect("This is an English text about natural language processing.")
print(language)  # Output: "ENGLISH"

HTML Decoding

from nemo_curator.stages.text.download.utils import decode_html

# UTF-8 encoded bytes
html = decode_html(b"<html><body>Hello World</body></html>")
print(html)  # Output: "<html><body>Hello World</body></html>"

# Non-UTF-8 bytes will be detected and decoded
latin1_bytes = "Caf\xe9".encode("latin-1")
html = decode_html(latin1_bytes)

Full Detection Pipeline

from nemo_curator.stages.text.download.utils import detect_language

is_reliable, bytes_found, details = detect_language("Bonjour le monde")
print(f"Reliable: {is_reliable}")
print(f"Top language: {details[0][0]}")  # "FRENCH"
print(f"Confidence: {details[0][2]}%")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment