Implementation:Datajuicer Data juicer ExtractTablesFromHtmlMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for extracting table data from HTML content provided by Data-Juicer.
Description
ExtractTablesFromHtmlMapper is a mapper operator that processes HTML content to extract tables and stores them in the sample's metadata. It uses BeautifulSoup (bs4) to parse the HTML and find all table elements. Depending on configuration, it either retains raw HTML tags or extracts plain-text cell data by iterating over tr, th, and td elements. The include_header parameter controls whether table headers are included in text-only output. Extracted tables are stored in the metadata field specified by tables_field_name.
Usage
Use when processing web-scraped datasets containing HTML content where tables hold valuable structured data that would otherwise be lost during HTML cleaning.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/extract_tables_from_html_mapper.py
Signature
@OPERATORS.register_module("extract_tables_from_html_mapper")
class ExtractTablesFromHtmlMapper(Mapper):
def __init__(self,
tables_field_name: str = MetaKeys.html_tables,
retain_html_tags: bool = False,
include_header: bool = True,
*args,
**kwargs):
Import
from data_juicer.ops.mapper.extract_tables_from_html_mapper import ExtractTablesFromHtmlMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tables_field_name | str | No | Field name to store the extracted tables, defaults to MetaKeys.html_tables |
| retain_html_tags | bool | No | If True, retains HTML tags in the tables; otherwise removes them, defaults to False |
| include_header | bool | No | If True, includes table headers; effective only when retain_html_tags is False, defaults to True |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with extracted tables stored in meta field |
Usage Examples
process:
- extract_tables_from_html_mapper:
retain_html_tags: false
include_header: true