Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator Code Filters

From Leeroopedia
Knowledge Sources
Domains Code Quality, Data Curation, Filtering
Last Updated 2026-02-14 00:00 GMT

Overview

Provides eight code-specific document filters for quality assessment of source code files, implementing heuristics from the StarCoder and C4 research papers.

Description

This module defines eight DocumentFilter subclasses, each targeting a specific aspect of source code quality:

  • PythonCommentToCodeFilter - Computes the comment-to-code ratio for Python files using AST-based extraction of docstrings and comments. Keeps documents where the ratio falls within configurable min/max bounds (default: 0.01 to 0.85).
  • GeneralCommentToCodeFilter - Computes comment-to-code ratio for non-Python languages using the comment_parser library with MIME-type-based language detection. Same threshold approach as the Python variant but does not count comment delimiters (//, /* */) toward comment length.
  • NumberOfLinesOfCodeFilter - Filters files based on line count, keeping files within configurable min/max bounds (default: 10 to 20000 lines).
  • TokenizerFertilityFilter - Checks the character-to-token ratio using a SentencePiece tokenizer. Files with a ratio below the threshold (default: 2.5) are filtered out, as they may indicate auto-generated or obfuscated code.
  • XMLHeaderFilter - Detects files with incorrect file extensions that are actually XML files, by checking for <?xml version= in the first 100 characters. Based on the StarCoder methodology.
  • AlphaFilter - Filters files with too few alphabetic characters (default threshold: 25%), catching files that contain large embedded tensors or tables stored as raw text. Also from the StarCoder paper.
  • HTMLBoilerplateFilter - Uses BeautifulSoup to detect HTML files that are largely boilerplate by computing the ratio of visible text (after removing script/style elements) to total source length.
  • PerExtensionFilter - Applies language- and file-extension-specific filtering thresholds loaded from a CSV configuration file. Checks line length statistics, alphanumeric character ratios, and alphabetic character fractions, with parameters varying per language/extension combination.

Usage

Use these filters when curating source code datasets. They are typically composed into a filtering pipeline alongside the ScoreFilter stage. Each filter follows the DocumentFilter protocol: call score_document to compute a metric, then keep_document to make a keep/discard decision.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/filters/code.py
  • Lines: 1-298

Signature

class PythonCommentToCodeFilter(DocumentFilter):
    def __init__(self, min_comment_to_code_ratio: float = 0.01, max_comment_to_code_ratio: float = 0.85): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float) -> bool: ...

class GeneralCommentToCodeFilter(DocumentFilter):
    def __init__(self, language: str, min_comment_to_code_ratio: float = 0.01, max_comment_to_code_ratio: float = 0.85): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float) -> bool: ...

class NumberOfLinesOfCodeFilter(DocumentFilter):
    def __init__(self, min_lines: int = 10, max_lines: int = 20000): ...
    def score_document(self, source: str) -> int: ...
    def keep_document(self, score: int) -> bool: ...

class TokenizerFertilityFilter(DocumentFilter):
    def __init__(self, path_to_tokenizer: str | None = None, min_char_to_token_ratio: float = 2.5): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float) -> bool: ...

class XMLHeaderFilter(DocumentFilter):
    def __init__(self, char_prefix_search_length: int = 100): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float) -> bool: ...

class AlphaFilter(DocumentFilter):
    def __init__(self, min_alpha_ratio: float = 0.25): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float) -> bool: ...

class HTMLBoilerplateFilter(DocumentFilter):
    def __init__(self, min_lang_content_ratio: float = 0.2, min_lang_content_num_chars: int = 100): ...
    def score_document(self, source: str) -> float | None: ...
    def keep_document(self, score: float) -> bool: ...

class PerExtensionFilter(DocumentFilter):
    def __init__(self, lang: str, extension: str, metadata_file: str = "code_meta.csv"): ...
    def score_document(self, source: str) -> float: ...
    def keep_document(self, score: float | None) -> bool: ...

Import

from nemo_curator.stages.text.filters.code import (
    PythonCommentToCodeFilter,
    GeneralCommentToCodeFilter,
    NumberOfLinesOfCodeFilter,
    TokenizerFertilityFilter,
    XMLHeaderFilter,
    AlphaFilter,
    HTMLBoilerplateFilter,
    PerExtensionFilter,
)

I/O Contract

Inputs (All Filters)

Filter Parameter Type Required Description
PythonCommentToCodeFilter min_comment_to_code_ratio float No Minimum comment-to-code ratio (default: 0.01)
PythonCommentToCodeFilter max_comment_to_code_ratio float No Maximum comment-to-code ratio (default: 0.85)
GeneralCommentToCodeFilter language str Yes MIME type string for the programming language
GeneralCommentToCodeFilter min_comment_to_code_ratio float No Minimum comment-to-code ratio (default: 0.01)
GeneralCommentToCodeFilter max_comment_to_code_ratio float No Maximum comment-to-code ratio (default: 0.85)
NumberOfLinesOfCodeFilter min_lines int No Minimum line count (default: 10)
NumberOfLinesOfCodeFilter max_lines int No Maximum line count (default: 20000)
TokenizerFertilityFilter path_to_tokenizer str Yes Path to a SentencePiece tokenizer model
TokenizerFertilityFilter min_char_to_token_ratio float No Minimum character-to-token ratio (default: 2.5)
XMLHeaderFilter char_prefix_search_length int No Number of characters to inspect at start of file (default: 100)
AlphaFilter min_alpha_ratio float No Minimum ratio of alphabetic characters (default: 0.25)
HTMLBoilerplateFilter min_lang_content_ratio float No Minimum visible text ratio (default: 0.2)
HTMLBoilerplateFilter min_lang_content_num_chars int No Minimum visible text character count (default: 100)
PerExtensionFilter lang str Yes Programming language name
PerExtensionFilter extension str Yes File extension to filter
PerExtensionFilter metadata_file str No Path to CSV with filter thresholds (default: "code_meta.csv")

Outputs (All Filters)

Method Return Type Description
score_document float or int Computed quality metric for the document
keep_document bool True if the document passes the filter, False otherwise

Usage Examples

Python Comment Ratio Filter

from nemo_curator.stages.text.filters.code import PythonCommentToCodeFilter

filter = PythonCommentToCodeFilter(
    min_comment_to_code_ratio=0.01,
    max_comment_to_code_ratio=0.85,
)

score = filter.score_document(source_code)
keep = filter.keep_document(score)

Line Count Filter

from nemo_curator.stages.text.filters.code import NumberOfLinesOfCodeFilter

filter = NumberOfLinesOfCodeFilter(min_lines=10, max_lines=20000)
score = filter.score_document(source_code)
keep = filter.keep_document(score)  # True if 10 <= lines <= 20000

XML Header Detection

from nemo_curator.stages.text.filters.code import XMLHeaderFilter

filter = XMLHeaderFilter(char_prefix_search_length=100)
score = filter.score_document(source_code)
keep = filter.keep_document(score)  # False if file starts with <?xml version=

Per-Extension Filtering

from nemo_curator.stages.text.filters.code import PerExtensionFilter

filter = PerExtensionFilter(
    lang="Python",
    extension=".py",
    metadata_file="code_meta.csv",
)
score = filter.score_document(source_code)
keep = filter.keep_document(score)

Filter Details

Comment-to-Code Ratio Filters

Both PythonCommentToCodeFilter and GeneralCommentToCodeFilter compute the ratio of comment text to total source length. The Python variant uses get_comments_and_docstring (AST-based), while the general variant uses the comment_parser library with MIME types. Both return 0 when no comments are found. The GeneralCommentToCodeFilter returns 9999 on tokenization errors to signal an anomaly.

PerExtensionFilter CSV Format

The PerExtensionFilter loads filter parameters from a CSV file with columns:

  • language - Programming language name
  • extension - File extension
  • Include - "1" to include, "0" to exclude
  • Long_line_threshold - Maximum line length (optional)
  • Alphanum_threshold - Minimum alphanumeric character fraction (optional)
  • Alpha filter - Minimum alphabetic character fraction (optional)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment