Implementation:NVIDIA NeMo Curator Code Filters

Knowledge Sources	NVIDIA NeMo Curator
Domains	Code Quality, Data Curation, Filtering
Last Updated	2026-02-14 00:00 GMT

Overview

Provides eight code-specific document filters for quality assessment of source code files, implementing heuristics from the StarCoder and C4 research papers.

Description

This module defines eight DocumentFilter subclasses, each targeting a specific aspect of source code quality:

PythonCommentToCodeFilter - Computes the comment-to-code ratio for Python files using AST-based extraction of docstrings and comments. Keeps documents where the ratio falls within configurable min/max bounds (default: 0.01 to 0.85).

GeneralCommentToCodeFilter - Computes comment-to-code ratio for non-Python languages using the comment_parser library with MIME-type-based language detection. Same threshold approach as the Python variant but does not count comment delimiters (//, /* */) toward comment length.

NumberOfLinesOfCodeFilter - Filters files based on line count, keeping files within configurable min/max bounds (default: 10 to 20000 lines).

TokenizerFertilityFilter - Checks the character-to-token ratio using a SentencePiece tokenizer. Files with a ratio below the threshold (default: 2.5) are filtered out, as they may indicate auto-generated or obfuscated code.

XMLHeaderFilter - Detects files with incorrect file extensions that are actually XML files, by checking for <?xml version= in the first 100 characters. Based on the StarCoder methodology.

AlphaFilter - Filters files with too few alphabetic characters (default threshold: 25%), catching files that contain large embedded tensors or tables stored as raw text. Also from the StarCoder paper.

HTMLBoilerplateFilter - Uses BeautifulSoup to detect HTML files that are largely boilerplate by computing the ratio of visible text (after removing script/style elements) to total source length.

PerExtensionFilter - Applies language- and file-extension-specific filtering thresholds loaded from a CSV configuration file. Checks line length statistics, alphanumeric character ratios, and alphabetic character fractions, with parameters varying per language/extension combination.

Usage

Use these filters when curating source code datasets. They are typically composed into a filtering pipeline alongside the ScoreFilter stage. Each filter follows the DocumentFilter protocol: call score_document to compute a metric, then keep_document to make a keep/discard decision.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/text/filters/code.py
Lines: 1-298

Signature

class PythonCommentToCodeFilter(DocumentFilter):
    def __init__(self, min_comment_to_code_ratio: float = 0.01, max_comment_to_code_ratio: float = 0.85): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float) -> bool: ...

class GeneralCommentToCodeFilter(DocumentFilter):
    def __init__(self, language: str, min_comment_to_code_ratio: float = 0.01, max_comment_to_code_ratio: float = 0.85): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float) -> bool: ...

class NumberOfLinesOfCodeFilter(DocumentFilter):
    def __init__(self, min_lines: int = 10, max_lines: int = 20000): ...
    def score_document(self, source: str) -> int: ...
    def keep_document(self, score: int) -> bool: ...

class TokenizerFertilityFilter(DocumentFilter):
    def __init__(self, path_to_tokenizer: str | None = None, min_char_to_token_ratio: float = 2.5): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float) -> bool: ...

class XMLHeaderFilter(DocumentFilter):
    def __init__(self, char_prefix_search_length: int = 100): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float) -> bool: ...

class AlphaFilter(DocumentFilter):
    def __init__(self, min_alpha_ratio: float = 0.25): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float) -> bool: ...

class HTMLBoilerplateFilter(DocumentFilter):
    def __init__(self, min_lang_content_ratio: float = 0.2, min_lang_content_num_chars: int = 100): ...
    def score_document(self, source: str) -> float | None: ...
    def keep_document(self, score: float) -> bool: ...

class PerExtensionFilter(DocumentFilter):
    def __init__(self, lang: str, extension: str, metadata_file: str = "code_meta.csv"): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float | None) -> bool: ...

Import

from nemo_curator.stages.text.filters.code import (
    PythonCommentToCodeFilter,
    GeneralCommentToCodeFilter,
    NumberOfLinesOfCodeFilter,
    TokenizerFertilityFilter,
    XMLHeaderFilter,
    AlphaFilter,
    HTMLBoilerplateFilter,
    PerExtensionFilter,
)

I/O Contract

Inputs (All Filters)

Filter	Parameter	Type	Required	Description
PythonCommentToCodeFilter	min_comment_to_code_ratio	float	No	Minimum comment-to-code ratio (default: 0.01)
PythonCommentToCodeFilter	max_comment_to_code_ratio	float	No	Maximum comment-to-code ratio (default: 0.85)
GeneralCommentToCodeFilter	language	str	Yes	MIME type string for the programming language
GeneralCommentToCodeFilter	min_comment_to_code_ratio	float	No	Minimum comment-to-code ratio (default: 0.01)
GeneralCommentToCodeFilter	max_comment_to_code_ratio	float	No	Maximum comment-to-code ratio (default: 0.85)
NumberOfLinesOfCodeFilter	min_lines	int	No	Minimum line count (default: 10)
NumberOfLinesOfCodeFilter	max_lines	int	No	Maximum line count (default: 20000)
TokenizerFertilityFilter	path_to_tokenizer	str	Yes	Path to a SentencePiece tokenizer model
TokenizerFertilityFilter	min_char_to_token_ratio	float	No	Minimum character-to-token ratio (default: 2.5)
XMLHeaderFilter	char_prefix_search_length	int	No	Number of characters to inspect at start of file (default: 100)
AlphaFilter	min_alpha_ratio	float	No	Minimum ratio of alphabetic characters (default: 0.25)
HTMLBoilerplateFilter	min_lang_content_ratio	float	No	Minimum visible text ratio (default: 0.2)
HTMLBoilerplateFilter	min_lang_content_num_chars	int	No	Minimum visible text character count (default: 100)
PerExtensionFilter	lang	str	Yes	Programming language name
PerExtensionFilter	extension	str	Yes	File extension to filter
PerExtensionFilter	metadata_file	str	No	Path to CSV with filter thresholds (default: "code_meta.csv")

Outputs (All Filters)

Method	Return Type	Description
score_document	float or int	Computed quality metric for the document
keep_document	bool	True if the document passes the filter, False otherwise

Usage Examples

Python Comment Ratio Filter

from nemo_curator.stages.text.filters.code import PythonCommentToCodeFilter

filter = PythonCommentToCodeFilter(
    min_comment_to_code_ratio=0.01,
    max_comment_to_code_ratio=0.85,
)

score = filter.score_document(source_code)
keep = filter.keep_document(score)

Line Count Filter

from nemo_curator.stages.text.filters.code import NumberOfLinesOfCodeFilter

filter = NumberOfLinesOfCodeFilter(min_lines=10, max_lines=20000)
score = filter.score_document(source_code)
keep = filter.keep_document(score)  # True if 10 <= lines <= 20000

XML Header Detection

from nemo_curator.stages.text.filters.code import XMLHeaderFilter

filter = XMLHeaderFilter(char_prefix_search_length=100)
score = filter.score_document(source_code)
keep = filter.keep_document(score)  # False if file starts with <?xml version=

Per-Extension Filtering

from nemo_curator.stages.text.filters.code import PerExtensionFilter

filter = PerExtensionFilter(
    lang="Python",
    extension=".py",
    metadata_file="code_meta.csv",
)
score = filter.score_document(source_code)
keep = filter.keep_document(score)

Filter Details

Comment-to-Code Ratio Filters

Both PythonCommentToCodeFilter and GeneralCommentToCodeFilter compute the ratio of comment text to total source length. The Python variant uses get_comments_and_docstring (AST-based), while the general variant uses the comment_parser library with MIME types. Both return 0 when no comments are found. The GeneralCommentToCodeFilter returns 9999 on tokenization errors to signal an anomaly.

PerExtensionFilter CSV Format

The PerExtensionFilter loads filter parameters from a CSV file with columns:

language - Programming language name
extension - File extension
Include - "1" to include, "0" to exclude
Long_line_threshold - Maximum line length (optional)
Alphanum_threshold - Minimum alphanumeric character fraction (optional)
Alpha filter - Minimum alphabetic character fraction (optional)

Related Pages

NVIDIA_NeMo_Curator_DocumentFilter - Abstract base class that all these filters implement
NVIDIA_NeMo_Curator_ScoreFilter - Processing stage that applies DocumentFilter instances to document batches
Environment:NVIDIA_NeMo_Curator_Python_Linux_Base

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment