Stability selection

The documentation of the stability_selection module.

This module contains a scikit-learn compatible implementation of stability selection [R8a58b3c21514-1] .

References

[R8a58b3c21514-1]Meinshausen, N. and Buhlmann, P., 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), pp.417-473.
[R8a58b3c21514-2]

Shah, R.D. and Samworth, R.J., 2013. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology),

75(1), pp.55-80.
class stability_selection.stability_selection.StabilitySelection(base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=None, penalty='l1', random_state=None, solver='warn', tol=0.0001, verbose=0, warm_start=False), lambda_name='C', lambda_grid=array([1.00000000e-05, 1.33352143e-05, 1.77827941e-05, 2.37137371e-05, 3.16227766e-05, 4.21696503e-05, 5.62341325e-05, 7.49894209e-05, 1.00000000e-04, 1.33352143e-04, 1.77827941e-04, 2.37137371e-04, 3.16227766e-04, 4.21696503e-04, 5.62341325e-04, 7.49894209e-04, 1.00000000e-03, 1.33352143e-03, 1.77827941e-03, 2.37137371e-03, 3.16227766e-03, 4.21696503e-03, 5.62341325e-03, 7.49894209e-03, 1.00000000e-02]), n_bootstrap_iterations=100, sample_fraction=0.5, threshold=0.6, bootstrap_func=<function bootstrap_without_replacement>, bootstrap_threshold=None, verbose=0, n_jobs=1, pre_dispatch='2*n_jobs', random_state=None)[source]

Stability selection [1] fits the estimator base_estimator on bootstrap samples of the original data set, for different values of the regularization parameter for base_estimator. Variables that reliably get selected by the model in these bootstrap samples are considered to be stable variables.

Parameters:
base_estimator : object.

The base estimator used for stability selection. The estimator must have either a feature_importances_ or coef_ attribute after fitting.

lambda_name : str.

The name of the penalization parameter for the estimator base_estimator.

lambda_grid : array-like.

Grid of values of the penalization parameter to iterate over.

n_bootstrap_iterations : integer.

Number of bootstrap samples to create.

sample_fraction : float, optional

The fraction of samples to be used in each bootstrap sample. Should be between 0 and 1. If 1, all samples are used.

threshold : float.

Threshold defining the minimum cutoff value for the stability scores.

bootstrap_func : str or callable fun (default=bootstrap_without_replacement)
The function used to subsample the data. This parameter can be:
  • A string, which must be one of
    • ‘subsample’: For subsampling without replacement.
    • ‘complementary_pairs’: For complementary pairs subsampling [2] .
    • ‘stratified’: For stratified bootstrapping in imbalanced
      classification.
  • A function that takes y, and a random state as inputs and returns a list of sample indices in the range (0, len(y)-1). By default, indices are uniformly subsampled.
bootstrap_threshold : string, float, optional default None

The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.

verbose : integer.

Controls the verbosity: the higher, the more messages.

n_jobs : int, default=1

Number of jobs to run in parallel.

pre_dispatch : int, or string, optional

Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

  • None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
  • An int, giving the exact number of total jobs that are spawned
  • A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
random_state : int, RandomState instance or None, optional, default=None

Pseudo random number generator state used for random uniform sampling from lists of possible values instead of scipy.stats distributions. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

References

[1](1, 2) Meinshausen, N. and Buhlmann, P., 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), pp.417-473.
[2](1, 2)

Shah, R.D. and Samworth, R.J., 2013. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology),

75(1), pp.55-80.
Attributes:
stability_scores_ : array, shape = [n_features, n_alphas]

Array of stability scores for each feature for each value of the penalization parameter.

fit(X, y)[source]

Fit the stability selection model on the given data.

Parameters:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]

The training input samples.

y : array-like, shape = [n_samples]

The target values.

get_support(indices=False, threshold=None)[source]

Get a mask, or integer index, of the features selected

Parameters:
indices : boolean (default False)

If True, the return value will be an array of integers, rather than a boolean mask.

threshold: float.

Threshold defining the minimum cutoff value for the stability scores.

Returns:
support : array

An index that selects the retained features from a feature vector. If indices is False, this is a boolean array of shape [# input features], in which an element is True iff its corresponding feature is selected for retention. If indices is True, this is an integer array of shape [# output features] whose values are indices into the input feature vector.

transform(X, threshold=None)[source]

Reduce X to the selected features.

Parameters:
X : array of shape [n_samples, n_features]

The input samples.

threshold: float.

Threshold defining the minimum cutoff value for the stability scores.

Returns:
X_r : array of shape [n_samples, n_selected_features]

The input samples with only the selected features.

stability_selection.stability_selection.plot_stability_path(stability_selection, threshold_highlight=None, **kwargs)[source]

Plots stability path.

Parameters:
stability_selection : StabilitySelection

Fitted instance of StabilitySelection.

threshold_highlight : float

Threshold defining the cutoff for the stability scores for the variables that need to be highlighted.

kwargs : dict

Arguments passed to matplotlib plot function.