Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer RemoveTableTextMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for removing table text from text samples provided by Data-Juicer.

Description

RemoveTableTextMapper is a mapper operator that removes tabular text data from text samples by detecting and stripping table-like structures with a configurable range of column counts. It iterates through column counts from min_col to max_col and applies a regex pattern matching consecutive lines containing whitespace or tab-separated columns (at least 2 rows). The column count range is constrained between 2 and 20. Operates in batched mode.

Usage

Use when cleaning documents containing embedded tables (e.g., from web scraping or PDF extraction) that would introduce structured noise into free-text training data.

Code Reference

Source Location

Signature

@OPERATORS.register_module("remove_table_text_mapper")
class RemoveTableTextMapper(Mapper):
    def __init__(
        self,
        min_col: Annotated[int, Field(ge=2, le=20)] = 2,
        max_col: Annotated[int, Field(ge=2, le=20)] = 20,
        *args,
        **kwargs,
    ):

Import

from data_juicer.ops.mapper.remove_table_text_mapper import RemoveTableTextMapper

I/O Contract

Inputs

Name Type Required Description
min_col int No Minimum number of columns in tables to remove (default: 2, range: 2-20)
max_col int No Maximum number of columns in tables to remove (default: 20, range: 2-20)

Outputs

Name Type Description
samples Dict Transformed samples with table text removed

Usage Examples

process:
  - remove_table_text_mapper:
      min_col: 2
      max_col: 20

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment