Implementation:Datajuicer Data juicer ExpandMacroMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for expanding LaTeX macro definitions inline within document bodies provided by Data-Juicer.
Description
ExpandMacroMapper is a mapper operator that expands user-defined LaTeX macro definitions (\newcommand and \def) inline within the document body of LaTeX text samples. It parses the text to extract non-argument macro definitions using two regex patterns, builds a macro-name-to-value dictionary, then iteratively substitutes each macro name with its value throughout the text. It uses word-boundary-aware replacement to avoid expanding partial matches within longer words. Currently does not support macros with arguments. It operates in batched mode. Originally adapted from RedPajama-Data's arXiv cleaner. It extends the Mapper base class.
Usage
Import when you need to expand LaTeX macros in academic datasets so downstream processing sees actual content rather than opaque macro references.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/expand_macro_mapper.py
Signature
@OPERATORS.register_module("expand_macro_mapper")
class ExpandMacroMapper(Mapper):
def __init__(self, *args, **kwargs):
Import
from data_juicer.ops.mapper.expand_macro_mapper import ExpandMacroMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| (no custom parameters) | -- | -- | Uses only base Mapper parameters (args, kwargs) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with LaTeX macros expanded inline in text |
Usage Examples
YAML Configuration
process:
- expand_macro_mapper: