Implementation:Datajuicer Data juicer CleanCopyrightMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing copyright comment headers from the beginning of text samples provided by Data-Juicer.
Description
CleanCopyrightMapper is a mapper operator that cleans copyright comments from the start of text samples, commonly found in source code files. It uses a two-phase approach: first, it searches for multi-line C-style comments (/* ... */) containing the word "copyright" and strips them. If no such block is found, it greedily removes leading lines that start with comment markers (//, #, --) or are empty, which are typically copyright headers in code files. It operates in batched mode for efficiency. Originally adapted from RedPajama-Data. It extends the Mapper base class.
Usage
Import when you need to remove boilerplate copyright notices from code-based datasets.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/clean_copyright_mapper.py
Signature
@OPERATORS.register_module("clean_copyright_mapper")
class CleanCopyrightMapper(Mapper):
def __init__(self, *args, **kwargs):
Import
from data_juicer.ops.mapper.clean_copyright_mapper import CleanCopyrightMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (no custom parameters) | -- | -- | Uses only base Mapper parameters (args, kwargs) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with copyright headers removed from text |
Usage Examples
YAML Configuration
process:
- clean_copyright_mapper: