52 lines
3.3 KiB
Plaintext

This rule raises an issue when random number generators do not specify a seed parameter.
== Why is this an issue?
Data science and machine learning tasks make extensive use of random number generation. It may, for example, be used for:
* Model initialization
** Randomness is used to initialize the parameters of machine learning models. Initializing parameters with random values helps to break symmetry and prevents models from getting stuck in local optima during training. By providing a random starting point, the model can explore different regions of the parameter space and potentially find better solutions.
* Regularization techniques
** Randomness is used to introduce noise into the learning process. Techniques like dropout and data augmentation use random numbers to randomly drop or modify features or samples during training. This helps to regularize the model, reduce overfitting, and improve generalization performance.
* Cross-validation and bootstrapping
** Randomness is often used in techniques like cross-validation, where data is split into multiple subsets. By using a predictable seed, the same data splits can be generated, allowing for fair and consistent model evaluation.
* Hyperparameter tuning
** Many machine learning algorithms have hyperparameters that need to be tuned for optimal performance. Randomness is often used in techniques like random search or Bayesian optimization to explore the hyperparameter space. By using a fixed seed, the same set of hyperparameters can be explored, making the tuning process more controlled and reproducible.
* Simulation and synthetic data generation
** Randomness is often used in techniques such as data augmentation and synthetic data generation to generate diverse and realistic datasets.
To ensure that results are reproducible, it is important to use a predictable seed in this context.
The preferred way to do this in `numpy` is by instantiating a `Generator` object, typically through `numpy.random.default_rng`, which should be provided with a seed parameter.
Note that a global seed for `RandomState` can be set using `numpy.random.seed` or `numpy.seed`, this will set the seed for `RandomState` methods such as `numpy.random.randn`. This approach is, however, deprecated and `Generator` should be used instead. This is reported by rule S6711.
=== Exception
In contexts that are not related to data science and machine learning, having a predictable seed may not be the desired behavior. Therefore, this rule will only raise issues if machine learning and data science libraries are being used.
// How to fix it section
include::how-to-fix-it/numpy.adoc[]
include::how-to-fix-it/sklearn.adoc[]
== Resources
=== Documentation
* NumPy documentation - https://numpy.org/neps/nep-0019-rng-policy.html[NEP 19 RNG Policy]
* Scikit-learn documentation - https://scikit-learn.org/stable/glossary.html#term-random_state[Glossary random_state]
* Scikit-learn documentation - https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness[Controlling randomness]
=== Standards
* STIG Viewer - https://stigviewer.com/stig/application_security_and_development/2023-06-08/finding/V-222642[Application Security and Development: V-222642] - The application must not contain embedded authentication data.
=== Related rules
* S6711 - `numpy.random.Generator` should be preferred to `numpy.random.RandomState`