Implementation:Datajuicer Data juicer RemoveCommentsMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing comments from LaTeX documents provided by Data-Juicer.
Description
RemoveCommentsMapper is a mapper operator that removes inline and multiline comments from text samples, currently supporting only the 'tex' document format. It uses two regex patterns: one for inline comments that removes text after unescaped % characters within a line, and another for multiline comments that removes entire lines beginning with %. Both inline and multiline removal can be independently controlled via boolean parameters. Operates in batched mode.
Usage
Use when cleaning LaTeX source files where comments contain non-content text that would degrade training data quality.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/remove_comments_mapper.py
Signature
@OPERATORS.register_module("remove_comments_mapper")
class RemoveCommentsMapper(Mapper):
def __init__(
self, doc_type: Union[str, List[str]] = "tex", inline: bool = True, multiline: bool = True, *args, **kwargs
):
Import
from data_juicer.ops.mapper.remove_comments_mapper import RemoveCommentsMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| doc_type | Union[str, List[str]] | No | Type of document to remove comments from (default: "tex") |
| inline | bool | No | Whether to remove inline comments (default: True) |
| multiline | bool | No | Whether to remove multiline comments (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with comments removed |
Usage Examples
process:
- remove_comments_mapper:
doc_type: 'tex'
inline: true
multiline: true