Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer ExtractTablesFromHtmlMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for extracting table data from HTML content provided by Data-Juicer.

Description

ExtractTablesFromHtmlMapper is a mapper operator that processes HTML content to extract tables and stores them in the sample's metadata. It uses BeautifulSoup (bs4) to parse the HTML and find all table elements. Depending on configuration, it either retains raw HTML tags or extracts plain-text cell data by iterating over tr, th, and td elements. The include_header parameter controls whether table headers are included in text-only output. Extracted tables are stored in the metadata field specified by tables_field_name.

Usage

Use when processing web-scraped datasets containing HTML content where tables hold valuable structured data that would otherwise be lost during HTML cleaning.

Code Reference

Source Location

Signature

@OPERATORS.register_module("extract_tables_from_html_mapper")
class ExtractTablesFromHtmlMapper(Mapper):
    def __init__(self,
                 tables_field_name: str = MetaKeys.html_tables,
                 retain_html_tags: bool = False,
                 include_header: bool = True,
                 *args,
                 **kwargs):

Import

from data_juicer.ops.mapper.extract_tables_from_html_mapper import ExtractTablesFromHtmlMapper

I/O Contract

Inputs

Name Type Required Description
tables_field_name str No Field name to store the extracted tables, defaults to MetaKeys.html_tables
retain_html_tags bool No If True, retains HTML tags in the tables; otherwise removes them, defaults to False
include_header bool No If True, includes table headers; effective only when retain_html_tags is False, defaults to True

Outputs

Name Type Description
samples Dict Transformed samples with extracted tables stored in meta field

Usage Examples

process:
  - extract_tables_from_html_mapper:
      retain_html_tags: false
      include_header: true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment