Implementation:Datajuicer Data juicer CleanLinksMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing or replacing URLs and web links from text samples provided by Data-Juicer.
Description
CleanLinksMapper is a mapper operator that cleans URLs and web links (HTTP, HTTPS, FTP, etc.) from text samples using regular expression matching. It applies a comprehensive regex pattern that matches various URL formats including protocol-prefixed links and www domains to each text in a batch, replacing matches with a configurable replacement string (empty by default, effectively removing links). The default pattern handles complex URL structures including parentheses and query strings. Adapted from the CleanText library. It operates in batched mode for efficiency. It extends the Mapper base class.
Usage
Import when you need to remove web links from training data that typically do not contribute meaningful semantic content.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/clean_links_mapper.py
Signature
@OPERATORS.register_module("clean_links_mapper")
class CleanLinksMapper(Mapper):
def __init__(self,
pattern: Optional[str] = None,
repl: str = "",
*args, **kwargs):
Import
from data_juicer.ops.mapper.clean_links_mapper import CleanLinksMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pattern | Optional[str] | No | Regular expression pattern to search for within text. Default: comprehensive URL matching pattern |
| repl | str | No | Replacement string for matched patterns. Default: "" (removes links) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with URLs and web links removed or replaced in text |
Usage Examples
YAML Configuration
process:
- clean_links_mapper:
repl: "<URL>"