Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer Execute HPO 3Sigma

From Leeroopedia
Knowledge Sources
Domains Tooling
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for automatic hyper-parameter optimization of data recipes using the 3-sigma rule provided by Data-Juicer.

Description

execute_hpo_3sigma implements automatic hyper-parameter optimization for data recipes using the k-sigma (3-sigma) statistical principle, where filter bounds are set to k standard deviations from the mean of each metric's distribution. The main function first runs the Analyzer to compute statistics (mean, std) for each metric in the dataset, then modify_recipe_k_sigma adjusts all min_* and max_* parameters in the recipe's filter operators to mean +/- k*std. The refined recipe is optionally saved to YAML/JSON, then the DefaultExecutor (or RayExecutor) processes the data with the optimized parameters.

Usage

Use when you want to automatically tune filter thresholds in a data recipe based on the statistical distribution of the data, removing outliers under the assumption that data metrics follow a normal distribution.

Code Reference

Source Location

Signature

@logger.catch(reraise=True)
def main():

def modify_recipe_k_sigma(cfg, df, path_k_sigma_recipe, k=3):

Import

from data_juicer.tools.hpo.execute_hpo_3sigma import main, modify_recipe_k_sigma

I/O Contract

Inputs

Name Type Required Description
cfg Namespace Yes Data-Juicer configuration with process list containing filter operators
df pandas.DataFrame Yes Analysis results DataFrame with mean and std rows for each metric
path_k_sigma_recipe str No Path to save the refined recipe (YAML or JSON). Passed via --path_k_sigma_recipe CLI arg
k int No Number of standard deviations for the sigma rule. Default: 3

Outputs

Name Type Description
refined_recipe file (YAML/JSON) The recipe file with adjusted min/max parameters (if path_k_sigma_recipe is provided)
processed_dataset Dataset The dataset processed with the optimized recipe parameters

Usage Examples

# Run from command line with an initial recipe
# python -m data_juicer.tools.hpo.execute_hpo_3sigma \
#   --config initial_recipe.yaml \
#   --path_k_sigma_recipe refined_recipe.yaml

# Programmatic usage of modify_recipe_k_sigma
from data_juicer.tools.hpo.execute_hpo_3sigma import modify_recipe_k_sigma

# After running Analyzer to get statistics
modify_recipe_k_sigma(
    cfg=my_config,
    df=analysis_results,
    path_k_sigma_recipe="refined_recipe.yaml",
    k=3
)
# All min_*/max_* parameters in filters are now set to mean +/- 3*std

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment