Implementation:NVIDIA NeMo Curator Code Filters
| Knowledge Sources | |
|---|---|
| Domains | Code Quality, Data Curation, Filtering |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Provides eight code-specific document filters for quality assessment of source code files, implementing heuristics from the StarCoder and C4 research papers.
Description
This module defines eight DocumentFilter subclasses, each targeting a specific aspect of source code quality:
- PythonCommentToCodeFilter - Computes the comment-to-code ratio for Python files using AST-based extraction of docstrings and comments. Keeps documents where the ratio falls within configurable min/max bounds (default: 0.01 to 0.85).
- GeneralCommentToCodeFilter - Computes comment-to-code ratio for non-Python languages using the comment_parser library with MIME-type-based language detection. Same threshold approach as the Python variant but does not count comment delimiters (
//,/* */) toward comment length.
- NumberOfLinesOfCodeFilter - Filters files based on line count, keeping files within configurable min/max bounds (default: 10 to 20000 lines).
- TokenizerFertilityFilter - Checks the character-to-token ratio using a SentencePiece tokenizer. Files with a ratio below the threshold (default: 2.5) are filtered out, as they may indicate auto-generated or obfuscated code.
- XMLHeaderFilter - Detects files with incorrect file extensions that are actually XML files, by checking for
<?xml version=in the first 100 characters. Based on the StarCoder methodology.
- AlphaFilter - Filters files with too few alphabetic characters (default threshold: 25%), catching files that contain large embedded tensors or tables stored as raw text. Also from the StarCoder paper.
- HTMLBoilerplateFilter - Uses BeautifulSoup to detect HTML files that are largely boilerplate by computing the ratio of visible text (after removing script/style elements) to total source length.
- PerExtensionFilter - Applies language- and file-extension-specific filtering thresholds loaded from a CSV configuration file. Checks line length statistics, alphanumeric character ratios, and alphabetic character fractions, with parameters varying per language/extension combination.
Usage
Use these filters when curating source code datasets. They are typically composed into a filtering pipeline alongside the ScoreFilter stage. Each filter follows the DocumentFilter protocol: call score_document to compute a metric, then keep_document to make a keep/discard decision.
Code Reference
Source Location
- Repository: NeMo-Curator
- File: nemo_curator/stages/text/filters/code.py
- Lines: 1-298
Signature
class PythonCommentToCodeFilter(DocumentFilter):
def __init__(self, min_comment_to_code_ratio: float = 0.01, max_comment_to_code_ratio: float = 0.85): ...
def score_document(self, source: str) -> float: ...
def keep_document(self, score: float) -> bool: ...
class GeneralCommentToCodeFilter(DocumentFilter):
def __init__(self, language: str, min_comment_to_code_ratio: float = 0.01, max_comment_to_code_ratio: float = 0.85): ...
def score_document(self, source: str) -> float: ...
def keep_document(self, score: float) -> bool: ...
class NumberOfLinesOfCodeFilter(DocumentFilter):
def __init__(self, min_lines: int = 10, max_lines: int = 20000): ...
def score_document(self, source: str) -> int: ...
def keep_document(self, score: int) -> bool: ...
class TokenizerFertilityFilter(DocumentFilter):
def __init__(self, path_to_tokenizer: str | None = None, min_char_to_token_ratio: float = 2.5): ...
def score_document(self, source: str) -> float: ...
def keep_document(self, score: float) -> bool: ...
class XMLHeaderFilter(DocumentFilter):
def __init__(self, char_prefix_search_length: int = 100): ...
def score_document(self, source: str) -> float: ...
def keep_document(self, score: float) -> bool: ...
class AlphaFilter(DocumentFilter):
def __init__(self, min_alpha_ratio: float = 0.25): ...
def score_document(self, source: str) -> float: ...
def keep_document(self, score: float) -> bool: ...
class HTMLBoilerplateFilter(DocumentFilter):
def __init__(self, min_lang_content_ratio: float = 0.2, min_lang_content_num_chars: int = 100): ...
def score_document(self, source: str) -> float | None: ...
def keep_document(self, score: float) -> bool: ...
class PerExtensionFilter(DocumentFilter):
def __init__(self, lang: str, extension: str, metadata_file: str = "code_meta.csv"): ...
def score_document(self, source: str) -> float: ...
def keep_document(self, score: float | None) -> bool: ...
Import
from nemo_curator.stages.text.filters.code import (
PythonCommentToCodeFilter,
GeneralCommentToCodeFilter,
NumberOfLinesOfCodeFilter,
TokenizerFertilityFilter,
XMLHeaderFilter,
AlphaFilter,
HTMLBoilerplateFilter,
PerExtensionFilter,
)
I/O Contract
Inputs (All Filters)
| Filter | Parameter | Type | Required | Description |
|---|---|---|---|---|
| PythonCommentToCodeFilter | min_comment_to_code_ratio | float | No | Minimum comment-to-code ratio (default: 0.01) |
| PythonCommentToCodeFilter | max_comment_to_code_ratio | float | No | Maximum comment-to-code ratio (default: 0.85) |
| GeneralCommentToCodeFilter | language | str | Yes | MIME type string for the programming language |
| GeneralCommentToCodeFilter | min_comment_to_code_ratio | float | No | Minimum comment-to-code ratio (default: 0.01) |
| GeneralCommentToCodeFilter | max_comment_to_code_ratio | float | No | Maximum comment-to-code ratio (default: 0.85) |
| NumberOfLinesOfCodeFilter | min_lines | int | No | Minimum line count (default: 10) |
| NumberOfLinesOfCodeFilter | max_lines | int | No | Maximum line count (default: 20000) |
| TokenizerFertilityFilter | path_to_tokenizer | str | Yes | Path to a SentencePiece tokenizer model |
| TokenizerFertilityFilter | min_char_to_token_ratio | float | No | Minimum character-to-token ratio (default: 2.5) |
| XMLHeaderFilter | char_prefix_search_length | int | No | Number of characters to inspect at start of file (default: 100) |
| AlphaFilter | min_alpha_ratio | float | No | Minimum ratio of alphabetic characters (default: 0.25) |
| HTMLBoilerplateFilter | min_lang_content_ratio | float | No | Minimum visible text ratio (default: 0.2) |
| HTMLBoilerplateFilter | min_lang_content_num_chars | int | No | Minimum visible text character count (default: 100) |
| PerExtensionFilter | lang | str | Yes | Programming language name |
| PerExtensionFilter | extension | str | Yes | File extension to filter |
| PerExtensionFilter | metadata_file | str | No | Path to CSV with filter thresholds (default: "code_meta.csv") |
Outputs (All Filters)
| Method | Return Type | Description |
|---|---|---|
| score_document | float or int | Computed quality metric for the document |
| keep_document | bool | True if the document passes the filter, False otherwise |
Usage Examples
Python Comment Ratio Filter
from nemo_curator.stages.text.filters.code import PythonCommentToCodeFilter
filter = PythonCommentToCodeFilter(
min_comment_to_code_ratio=0.01,
max_comment_to_code_ratio=0.85,
)
score = filter.score_document(source_code)
keep = filter.keep_document(score)
Line Count Filter
from nemo_curator.stages.text.filters.code import NumberOfLinesOfCodeFilter
filter = NumberOfLinesOfCodeFilter(min_lines=10, max_lines=20000)
score = filter.score_document(source_code)
keep = filter.keep_document(score) # True if 10 <= lines <= 20000
XML Header Detection
from nemo_curator.stages.text.filters.code import XMLHeaderFilter
filter = XMLHeaderFilter(char_prefix_search_length=100)
score = filter.score_document(source_code)
keep = filter.keep_document(score) # False if file starts with <?xml version=
Per-Extension Filtering
from nemo_curator.stages.text.filters.code import PerExtensionFilter
filter = PerExtensionFilter(
lang="Python",
extension=".py",
metadata_file="code_meta.csv",
)
score = filter.score_document(source_code)
keep = filter.keep_document(score)
Filter Details
Comment-to-Code Ratio Filters
Both PythonCommentToCodeFilter and GeneralCommentToCodeFilter compute the ratio of comment text to total source length. The Python variant uses get_comments_and_docstring (AST-based), while the general variant uses the comment_parser library with MIME types. Both return 0 when no comments are found. The GeneralCommentToCodeFilter returns 9999 on tokenization errors to signal an anomaly.
PerExtensionFilter CSV Format
The PerExtensionFilter loads filter parameters from a CSV file with columns:
- language - Programming language name
- extension - File extension
- Include - "1" to include, "0" to exclude
- Long_line_threshold - Maximum line length (optional)
- Alphanum_threshold - Minimum alphanumeric character fraction (optional)
- Alpha filter - Minimum alphabetic character fraction (optional)
Related Pages
- NVIDIA_NeMo_Curator_DocumentFilter - Abstract base class that all these filters implement
- NVIDIA_NeMo_Curator_ScoreFilter - Processing stage that applies DocumentFilter instances to document batches
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base