Implementation:Datajuicer Data juicer CleanHtmlMapper
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for stripping HTML tags from text samples and converting to plain text provided by Data-Juicer.
Description
CleanHtmlMapper is a mapper operator that cleans HTML code from text samples by converting HTML content to plain readable text. It replaces
- tags with newline-and-bullet-point formatting for readability, then uses the selectolax HTML parser to extract the remaining text content from the HTML structure. It operates in batched mode for efficiency. Originally adapted from RedPajama-Data. It extends the Mapper base class.
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/clean_html_mapper.py
Usage
Import when you need to remove HTML markup from web-scraped datasets to produce clean plain text.
Code Reference
Source Location
Signature
@OPERATORS.register_module("clean_html_mapper")
class CleanHtmlMapper(Mapper):
def __init__(self, *args, **kwargs):
Import
from data_juicer.ops.mapper.clean_html_mapper import CleanHtmlMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (no custom parameters) | -- | -- | Uses only base Mapper parameters (args, kwargs) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with HTML tags removed and plain text extracted |
Usage Examples
YAML Configuration
process:
- clean_html_mapper:
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment