Implementation:Datajuicer Data juicer Execute HPO 3Sigma
| Knowledge Sources | |
|---|---|
| Domains | Tooling |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for automatic hyper-parameter optimization of data recipes using the 3-sigma rule provided by Data-Juicer.
Description
execute_hpo_3sigma implements automatic hyper-parameter optimization for data recipes using the k-sigma (3-sigma) statistical principle, where filter bounds are set to k standard deviations from the mean of each metric's distribution. The main function first runs the Analyzer to compute statistics (mean, std) for each metric in the dataset, then modify_recipe_k_sigma adjusts all min_* and max_* parameters in the recipe's filter operators to mean +/- k*std. The refined recipe is optionally saved to YAML/JSON, then the DefaultExecutor (or RayExecutor) processes the data with the optimized parameters.
Usage
Use when you want to automatically tune filter thresholds in a data recipe based on the statistical distribution of the data, removing outliers under the assumption that data metrics follow a normal distribution.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/tools/hpo/execute_hpo_3sigma.py
Signature
@logger.catch(reraise=True)
def main():
def modify_recipe_k_sigma(cfg, df, path_k_sigma_recipe, k=3):
Import
from data_juicer.tools.hpo.execute_hpo_3sigma import main, modify_recipe_k_sigma
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| cfg | Namespace | Yes | Data-Juicer configuration with process list containing filter operators |
| df | pandas.DataFrame | Yes | Analysis results DataFrame with mean and std rows for each metric |
| path_k_sigma_recipe | str | No | Path to save the refined recipe (YAML or JSON). Passed via --path_k_sigma_recipe CLI arg |
| k | int | No | Number of standard deviations for the sigma rule. Default: 3 |
Outputs
| Name | Type | Description |
|---|---|---|
| refined_recipe | file (YAML/JSON) | The recipe file with adjusted min/max parameters (if path_k_sigma_recipe is provided) |
| processed_dataset | Dataset | The dataset processed with the optimized recipe parameters |
Usage Examples
# Run from command line with an initial recipe
# python -m data_juicer.tools.hpo.execute_hpo_3sigma \
# --config initial_recipe.yaml \
# --path_k_sigma_recipe refined_recipe.yaml
# Programmatic usage of modify_recipe_k_sigma
from data_juicer.tools.hpo.execute_hpo_3sigma import modify_recipe_k_sigma
# After running Analyzer to get statistics
modify_recipe_k_sigma(
cfg=my_config,
df=analysis_results,
path_k_sigma_recipe="refined_recipe.yaml",
k=3
)
# All min_*/max_* parameters in filters are now set to mean +/- 3*std