Implementation:Datajuicer Data juicer RemoveTableTextMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing table text from text samples provided by Data-Juicer.
Description
RemoveTableTextMapper is a mapper operator that removes tabular text data from text samples by detecting and stripping table-like structures with a configurable range of column counts. It iterates through column counts from min_col to max_col and applies a regex pattern matching consecutive lines containing whitespace or tab-separated columns (at least 2 rows). The column count range is constrained between 2 and 20. Operates in batched mode.
Usage
Use when cleaning documents containing embedded tables (e.g., from web scraping or PDF extraction) that would introduce structured noise into free-text training data.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/remove_table_text_mapper.py
Signature
@OPERATORS.register_module("remove_table_text_mapper")
class RemoveTableTextMapper(Mapper):
def __init__(
self,
min_col: Annotated[int, Field(ge=2, le=20)] = 2,
max_col: Annotated[int, Field(ge=2, le=20)] = 20,
*args,
**kwargs,
):
Import
from data_juicer.ops.mapper.remove_table_text_mapper import RemoveTableTextMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| min_col | int | No | Minimum number of columns in tables to remove (default: 2, range: 2-20) |
| max_col | int | No | Maximum number of columns in tables to remove (default: 20, range: 2-20) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with table text removed |
Usage Examples
process:
- remove_table_text_mapper:
min_col: 2
max_col: 20