Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer ExpandMacroMapper

From Leeroopedia
Knowledge Sources
Domains Data_Processing, Mapping
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for expanding LaTeX macro definitions inline within document bodies provided by Data-Juicer.

Description

ExpandMacroMapper is a mapper operator that expands user-defined LaTeX macro definitions (\newcommand and \def) inline within the document body of LaTeX text samples. It parses the text to extract non-argument macro definitions using two regex patterns, builds a macro-name-to-value dictionary, then iteratively substitutes each macro name with its value throughout the text. It uses word-boundary-aware replacement to avoid expanding partial matches within longer words. Currently does not support macros with arguments. It operates in batched mode. Originally adapted from RedPajama-Data's arXiv cleaner. It extends the Mapper base class.

Usage

Import when you need to expand LaTeX macros in academic datasets so downstream processing sees actual content rather than opaque macro references.

Code Reference

Source Location

Signature

@OPERATORS.register_module("expand_macro_mapper")
class ExpandMacroMapper(Mapper):
    def __init__(self, *args, **kwargs):

Import

from data_juicer.ops.mapper.expand_macro_mapper import ExpandMacroMapper

I/O Contract

Inputs

Name Type Required Description
(no custom parameters) -- -- Uses only base Mapper parameters (args, kwargs)

Outputs

Name Type Description
samples Dict Transformed samples with LaTeX macros expanded inline in text

Usage Examples

YAML Configuration

process:
  - expand_macro_mapper:

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment