Implementation:Datajuicer Data juicer RemoveBibliographyMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for removing bibliography sections from LaTeX documents provided by Data-Juicer.
Description
RemoveBibliographyMapper is a mapper operator that identifies and removes bibliography and reference sections from LaTeX documents. It uses a regular expression to match common LaTeX bibliography commands such as \appendix, \begin{references}, \begin{thebibliography}, and \bibliography, then removes everything from the matched command to the end of the document using re.DOTALL mode. Operates in batched mode for efficiency.
Usage
Use when cleaning academic or scientific LaTeX documents to ensure bibliography sections do not pollute training data with citation noise.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/remove_bibliography_mapper.py
Signature
@OPERATORS.register_module("remove_bibliography_mapper")
class RemoveBibliographyMapper(Mapper):
def __init__(self, *args, **kwargs):
Import
from data_juicer.ops.mapper.remove_bibliography_mapper import RemoveBibliographyMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str | Yes | LaTeX document text containing bibliography sections |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with bibliography sections removed |
Usage Examples
process:
- remove_bibliography_mapper: