Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer RemoveHeaderMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for removing headers from the beginning of LaTeX documents provided by Data-Juicer.

Description

RemoveHeaderMapper is a mapper operator that removes preamble and header content appearing before the first LaTeX sectioning command in document samples. It uses a regex pattern to match LaTeX sectioning commands (chapter, part, section, subsection, subsubsection, paragraph, subparagraph) and strips everything before the first match. If no header is found and drop_no_head is set to True, the entire text is cleared. Operates in batched mode.

Usage

Use when cleaning LaTeX documents to remove preamble boilerplate (package imports, document class declarations) that precedes actual content, improving data quality for NLP tasks.

Code Reference

Source Location

Signature

@OPERATORS.register_module("remove_header_mapper")
class RemoveHeaderMapper(Mapper):
    def __init__(self, drop_no_head: bool = True, *args, **kwargs):

Import

from data_juicer.ops.mapper.remove_header_mapper import RemoveHeaderMapper

I/O Contract

Inputs

Name Type Required Description
drop_no_head bool No Whether to drop sample texts without headers (default: True)

Outputs

Name Type Description
samples Dict Transformed samples with header content removed

Usage Examples

process:
  - remove_header_mapper:
      drop_no_head: true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment