Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Scikit learn Scikit learn Random State Management

From Leeroopedia




Knowledge Sources
Domains Machine_Learning, Reproducibility
Last Updated 2026-02-08 15:00 GMT

Overview

Guidelines for choosing between integer seeds and RandomState instances to balance reproducibility and robustness in cross-validation and model training.

Description

Scikit-learn estimators and splitters accept `random_state` as either an integer, a `numpy.random.RandomState` instance, or `None`. The choice between these has subtle but critical implications for reproducibility, cross-validation robustness, and model cloning behavior. An integer resets the RNG at each call (reproducible but identical randomness across folds), while a RandomState instance produces different randomness on each call (more robust for CV but fold-to-fold comparison is invalid).

Usage

Apply this heuristic when configuring any estimator with stochastic behavior (RandomForestClassifier, LogisticRegression with SAG/SAGA solver, GradientBoostingClassifier) or when setting up cross-validation (KFold, StratifiedKFold, train_test_split). It is critical for Train_Test_Split, KFold_Init, RandomForestClassifier_Init, and Cross_Validate implementations.

The Insight (Rule of Thumb)

  • Action: Pass an integer to CV splitters for consistent fold assignments. Pass a RandomState instance or `None` to estimators for robustness.
  • Value: `rng = np.random.RandomState(42)` at program start; pass `rng` to estimators, `42` to CV splitters.
  • Trade-off: Integer seeds give reproducibility within CV but same randomness across folds; RandomState instances give varying randomness (more realistic) but non-reproducible across runs.
  • Anti-pattern: Never use `np.random.seed()` — it sets global state and causes uncontrollable side effects.

Reasoning

When an integer is passed as `random_state`, the estimator creates a new RandomState from that integer on every `fit()` call, producing identical randomness. This means all cross-validation folds get exactly the same random behavior, which can hide instabilities. When a RandomState instance is passed, the same RNG object advances its state across calls, producing different random sequences for each fold. Additionally, when scikit-learn clones estimators (as in GridSearchCV), cloned objects with a RandomState instance share the same RNG object and influence each other.

From `doc/common_pitfalls.rst`:

"If an int, random_state is the seed used by the random number generator ... each call to fit resets it."

"If a RandomState instance, random_state is the random number generator ... each call to fit starts from a different state."

Code Evidence

SKLEARN_SEED environment variable from `sklearn/__init__.py:138-150`:

def setup_module(module):
    """Fixture for the tests to assure globally controllable seeding of RNGs"""
    import numpy as np
    _random_seed = os.environ.get("SKLEARN_SEED", None)
    if _random_seed is None:
        _random_seed = np.random.uniform() * np.iinfo(np.int32).max
    _random_seed = int(_random_seed)
    print("I: Seeding RNGs with %r" % _random_seed)
    np.random.seed(_random_seed)
    random.seed(_random_seed)

Global random seed test fixture from `sklearn/conftest.py:293`:

random_seed_var = environ.get("SKLEARN_TESTS_GLOBAL_RANDOM_SEED")
# Valid values: "42" (fixed), "40-42" (range), "all" (0-99 inclusive)
# Default: 42 for determinism

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment