Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer RemoveCommentsMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for removing comments from LaTeX documents provided by Data-Juicer.

Description

RemoveCommentsMapper is a mapper operator that removes inline and multiline comments from text samples, currently supporting only the 'tex' document format. It uses two regex patterns: one for inline comments that removes text after unescaped % characters within a line, and another for multiline comments that removes entire lines beginning with %. Both inline and multiline removal can be independently controlled via boolean parameters. Operates in batched mode.

Usage

Use when cleaning LaTeX source files where comments contain non-content text that would degrade training data quality.

Code Reference

Source Location

Signature

@OPERATORS.register_module("remove_comments_mapper")
class RemoveCommentsMapper(Mapper):
    def __init__(
        self, doc_type: Union[str, List[str]] = "tex", inline: bool = True, multiline: bool = True, *args, **kwargs
    ):

Import

from data_juicer.ops.mapper.remove_comments_mapper import RemoveCommentsMapper

I/O Contract

Inputs

Name Type Required Description
doc_type Union[str, List[str]] No Type of document to remove comments from (default: "tex")
inline bool No Whether to remove inline comments (default: True)
multiline bool No Whether to remove multiline comments (default: True)

Outputs

Name Type Description
samples Dict Transformed samples with comments removed

Usage Examples

process:
  - remove_comments_mapper:
      doc_type: 'tex'
      inline: true
      multiline: true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment