Implementation:Datajuicer Data juicer Monitor
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Core |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for monitoring CPU, memory, and GPU resource utilization during data processing provided by Data-Juicer.
Description
Monitor monitors and records CPU, memory, and GPU resource utilization during data processing operations, providing sampling, analysis (min/max/avg), and visualization of resource metrics over time. Monitor.monitor_func spawns a separate process via multiprocessing that periodically samples resource metrics (CPU utilization, memory usage, GPU memory/utilization) at a configurable interval. After the monitored function completes, it aggregates the samples into a resource utilization dict. analyze_single_resource_util computes max/min/avg statistics for each dynamic field, and draw_resource_util_graph generates time-series plots using matplotlib.
Usage
Use when you need runtime observability for data processing pipeline operations, including profiling resource consumption for batch size tuning or performance optimization.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/core/monitor.py
Signature
class Monitor:
DYNAMIC_FIELDS = {
"CPU util.", "Used mem.", "Free mem.", "Available mem.",
"Mem. util.", "GPU free mem.", "GPU used mem.", "GPU util.",
}
@staticmethod
def monitor_current_resources():
@staticmethod
def draw_resource_util_graph(resource_util_list, store_dir):
@staticmethod
def analyze_resource_util_list(resource_util_list):
@staticmethod
def analyze_single_resource_util(resource_util_dict):
@staticmethod
def monitor_func(func, args=None, sample_interval=0.5):
Import
from data_juicer.core.monitor import Monitor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| func | callable | Yes | The function to monitor during execution |
| args | dict, list, tuple, or Any | No | Arguments to pass to the monitored function. Default: None |
| sample_interval | float | No | Sampling interval in seconds for resource probing. Default: 0.5 |
| resource_util_list | list | Yes (for analysis/drawing) | List of resource utilization dicts to analyze or visualize |
| store_dir | str | Yes (for drawing) | Directory to save resource utilization graph images |
Outputs
| Name | Type | Description |
|---|---|---|
| ret | Any | Return value of the monitored function |
| resource_util_dict | dict | Dictionary containing 'time', 'sampling interval', and 'resource' (list of sampled resource snapshots) |
| resource_analysis | dict | Nested dict with max/min/avg statistics for each dynamic field |
Usage Examples
from data_juicer.core.monitor import Monitor
# Monitor a function's resource usage
def my_processing_func(dataset):
# ... heavy processing ...
return processed_dataset
result, resource_info = Monitor.monitor_func(
my_processing_func,
args={"dataset": my_dataset},
sample_interval=0.5
)
# Analyze collected resource data
analyzed = Monitor.analyze_single_resource_util(resource_info)
print(analyzed["resource_analysis"])
# Draw resource utilization graphs
Monitor.draw_resource_util_graph([resource_info], "./resource_plots")