Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer RemoveBibliographyMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for removing bibliography sections from LaTeX documents provided by Data-Juicer.

Description

RemoveBibliographyMapper is a mapper operator that identifies and removes bibliography and reference sections from LaTeX documents. It uses a regular expression to match common LaTeX bibliography commands such as \appendix, \begin{references}, \begin{thebibliography}, and \bibliography, then removes everything from the matched command to the end of the document using re.DOTALL mode. Operates in batched mode for efficiency.

Usage

Use when cleaning academic or scientific LaTeX documents to ensure bibliography sections do not pollute training data with citation noise.

Code Reference

Source Location

Signature

@OPERATORS.register_module("remove_bibliography_mapper")
class RemoveBibliographyMapper(Mapper):
    def __init__(self, *args, **kwargs):

Import

from data_juicer.ops.mapper.remove_bibliography_mapper import RemoveBibliographyMapper

I/O Contract

Inputs

Name Type Required Description
text str Yes LaTeX document text containing bibliography sections

Outputs

Name Type Description
samples Dict Transformed samples with bibliography sections removed

Usage Examples

process:
  - remove_bibliography_mapper:

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment