Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer CleanHtmlMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for stripping HTML tags from text samples and converting to plain text provided by Data-Juicer.

Description

CleanHtmlMapper is a mapper operator that cleans HTML code from text samples by converting HTML content to plain readable text. It replaces

  • and
      tags with newline-and-bullet-point formatting for readability, then uses the selectolax HTML parser to extract the remaining text content from the HTML structure. It operates in batched mode for efficiency. Originally adapted from RedPajama-Data. It extends the Mapper base class.

      Usage

      Import when you need to remove HTML markup from web-scraped datasets to produce clean plain text.

      Code Reference

      Source Location

      Signature

      @OPERATORS.register_module("clean_html_mapper")
      class CleanHtmlMapper(Mapper):
          def __init__(self, *args, **kwargs):
      

      Import

      from data_juicer.ops.mapper.clean_html_mapper import CleanHtmlMapper
      

      I/O Contract

      Inputs

      Name Type Required Description
      (no custom parameters) -- -- Uses only base Mapper parameters (args, kwargs)

      Outputs

      Name Type Description
      samples Dict Transformed samples with HTML tags removed and plain text extracted

      Usage Examples

      YAML Configuration

      process:
        - clean_html_mapper:
      

      Related Pages

  • Page Connections

    Double-click a node to navigate. Hold to expand connections.
    Principle
    Implementation
    Heuristic
    Environment