Principle:CarperAI Trlx Distributed Logging
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-07 16:00 GMT |
Overview
Pattern for managing log output in multi-process distributed training environments to prevent duplicate messages and enable rank-specific filtering.
Description
In distributed training, every process (rank) executes the same code, which can produce N copies of every log message. Distributed logging addresses this by filtering log messages based on the process rank, typically only emitting logs from rank 0 by default. Additional concerns include configurable verbosity levels, thread-safe logger initialization, and the ability to selectively enable logging from specific ranks for debugging.
Usage
Use this principle in any distributed training framework where multiple processes run concurrently. Essential for keeping log output readable and preventing log file bloat in multi-GPU or multi-node setups.
Theoretical Basis
The pattern is based on three mechanisms:
- Rank Filtering: Each log call checks the current process rank against an allow-list. Only matching ranks emit the message.
- Hierarchical Verbosity: Log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL) controlled globally via environment variable or API, affecting all loggers in the library.
- Singleton Initialization: Thread-safe, lazy initialization of the root logger ensures consistent configuration across all modules.
Pseudo-code Logic:
# Abstract algorithm (NOT real implementation)
def log(message, level, allowed_ranks=[0]):
current_rank = get_distributed_rank()
if current_rank in allowed_ranks:
emit(f"[Rank {current_rank}] {message}", level)