Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer CleanCopyrightMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for removing copyright comment headers from the beginning of text samples provided by Data-Juicer.

Description

CleanCopyrightMapper is a mapper operator that cleans copyright comments from the start of text samples, commonly found in source code files. It uses a two-phase approach: first, it searches for multi-line C-style comments (/* ... */) containing the word "copyright" and strips them. If no such block is found, it greedily removes leading lines that start with comment markers (//, #, --) or are empty, which are typically copyright headers in code files. It operates in batched mode for efficiency. Originally adapted from RedPajama-Data. It extends the Mapper base class.

Usage

Import when you need to remove boilerplate copyright notices from code-based datasets.

Code Reference

Source Location

Signature

@OPERATORS.register_module("clean_copyright_mapper")
class CleanCopyrightMapper(Mapper):
    def __init__(self, *args, **kwargs):

Import

from data_juicer.ops.mapper.clean_copyright_mapper import CleanCopyrightMapper

I/O Contract

Inputs

Name Type Required Description
(no custom parameters) -- -- Uses only base Mapper parameters (args, kwargs)

Outputs

Name Type Description
samples Dict Transformed samples with copyright headers removed from text

Usage Examples

YAML Configuration

process:
  - clean_copyright_mapper:

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment