Heuristic:Scikit learn Scikit learn Random State Management
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Reproducibility |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Guidelines for choosing between integer seeds and RandomState instances to balance reproducibility and robustness in cross-validation and model training.
Description
Scikit-learn estimators and splitters accept `random_state` as either an integer, a `numpy.random.RandomState` instance, or `None`. The choice between these has subtle but critical implications for reproducibility, cross-validation robustness, and model cloning behavior. An integer resets the RNG at each call (reproducible but identical randomness across folds), while a RandomState instance produces different randomness on each call (more robust for CV but fold-to-fold comparison is invalid).
Usage
Apply this heuristic when configuring any estimator with stochastic behavior (RandomForestClassifier, LogisticRegression with SAG/SAGA solver, GradientBoostingClassifier) or when setting up cross-validation (KFold, StratifiedKFold, train_test_split). It is critical for Train_Test_Split, KFold_Init, RandomForestClassifier_Init, and Cross_Validate implementations.
The Insight (Rule of Thumb)
- Action: Pass an integer to CV splitters for consistent fold assignments. Pass a RandomState instance or `None` to estimators for robustness.
- Value: `rng = np.random.RandomState(42)` at program start; pass `rng` to estimators, `42` to CV splitters.
- Trade-off: Integer seeds give reproducibility within CV but same randomness across folds; RandomState instances give varying randomness (more realistic) but non-reproducible across runs.
- Anti-pattern: Never use `np.random.seed()` — it sets global state and causes uncontrollable side effects.
Reasoning
When an integer is passed as `random_state`, the estimator creates a new RandomState from that integer on every `fit()` call, producing identical randomness. This means all cross-validation folds get exactly the same random behavior, which can hide instabilities. When a RandomState instance is passed, the same RNG object advances its state across calls, producing different random sequences for each fold. Additionally, when scikit-learn clones estimators (as in GridSearchCV), cloned objects with a RandomState instance share the same RNG object and influence each other.
From `doc/common_pitfalls.rst`:
"If an int, random_state is the seed used by the random number generator ... each call to fit resets it."
"If a RandomState instance, random_state is the random number generator ... each call to fit starts from a different state."
Code Evidence
SKLEARN_SEED environment variable from `sklearn/__init__.py:138-150`:
def setup_module(module):
"""Fixture for the tests to assure globally controllable seeding of RNGs"""
import numpy as np
_random_seed = os.environ.get("SKLEARN_SEED", None)
if _random_seed is None:
_random_seed = np.random.uniform() * np.iinfo(np.int32).max
_random_seed = int(_random_seed)
print("I: Seeding RNGs with %r" % _random_seed)
np.random.seed(_random_seed)
random.seed(_random_seed)
Global random seed test fixture from `sklearn/conftest.py:293`:
random_seed_var = environ.get("SKLEARN_TESTS_GLOBAL_RANDOM_SEED")
# Valid values: "42" (fixed), "40-42" (range), "all" (0-99 inclusive)
# Default: 42 for determinism
Related Pages
- Implementation:Scikit_learn_Scikit_learn_Train_Test_Split
- Implementation:Scikit_learn_Scikit_learn_KFold_Init
- Implementation:Scikit_learn_Scikit_learn_RandomForestClassifier_Init
- Implementation:Scikit_learn_Scikit_learn_LogisticRegression_Init
- Implementation:Scikit_learn_Scikit_learn_Cross_Validate
- Principle:Scikit_learn_Scikit_learn_Train_Test_Splitting
- Principle:Scikit_learn_Scikit_learn_KFold_Splitting