Implementation:Datajuicer Data juicer RemoveHeaderMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing headers from the beginning of LaTeX documents provided by Data-Juicer.
Description
RemoveHeaderMapper is a mapper operator that removes preamble and header content appearing before the first LaTeX sectioning command in document samples. It uses a regex pattern to match LaTeX sectioning commands (chapter, part, section, subsection, subsubsection, paragraph, subparagraph) and strips everything before the first match. If no header is found and drop_no_head is set to True, the entire text is cleared. Operates in batched mode.
Usage
Use when cleaning LaTeX documents to remove preamble boilerplate (package imports, document class declarations) that precedes actual content, improving data quality for NLP tasks.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/remove_header_mapper.py
Signature
@OPERATORS.register_module("remove_header_mapper")
class RemoveHeaderMapper(Mapper):
def __init__(self, drop_no_head: bool = True, *args, **kwargs):
Import
from data_juicer.ops.mapper.remove_header_mapper import RemoveHeaderMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| drop_no_head | bool | No | Whether to drop sample texts without headers (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with header content removed |
Usage Examples
process:
- remove_header_mapper:
drop_no_head: true