# Mutual information-based feature selection

Although model selection plays an important role in learning a signal from some input data, it is arguably even more important to give the algorithm the right input data. When building a model, the first step for a data scientist is typically to construct relevant features by doing appropriate feature engineering. The resulting data set, which is typically high-dimensional, can then be used as input for a statistical learner.

Although we’d like to think of these learners as smart, and sophisticated, algorithms, they can be fooled by all the weird dependencies present in your data. A data scientist has to make the signal as easily identifiable as possible for the model to learn it. In practice, this means that feature selection is an important preprocessing step. Feature selection helps to zone in on the relevant variables in a data set, and can also help to eliminate collinear variables. It helps reduce the noise in the data set, and it helps the model pick up the relevant signals.

## Filter methods

In the above setting, we typically have a high dimensional data matrix $X \in \mathbb{R}^{n \times p}$, and a target variable $y$ (discrete or continuous). A feature selection algorithm will select a subset of $% $ columns, $X_S \in \mathbb{R}^{n \times k}$, that are most relevant to the target variable $y$.

In general, we can divide feature selection algorithms as belonging to one of three classes:

1. Wrapper methods use learning algorithms on the original data $X$, and selects relevant features based on the (out-of-sample) performance of the learning algorithm. Training a random forest on the data $(X, y)$, and selecting relevant features based on the feature importances would be an example of a wrapper model.
2. Filter methods do not use a learning algorithm on the original data $X$, but only consider statistical characteristics of the input data. For example, we can select the features for which the correlation between the feature and the target variable exceeds a correlation threshold.
3. Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The LASSO is an example of an embedded method.

In this blog post I will focus on filter methods, and in particular I’ll look at filter methods that use an entropy measure called mutual information to assess which features should be included in the reduced data set $X_S$. The resulting criterion results in an NP-hard optimisation problem, and I’ll discuss several ways in which we can try to find optimal solutions to the problem.

# Pydata London 2017 and hyperopt

Last week I attended the PyData London conference, where I gave a talk about Bayesian optimization. The talk was based on my previous post on using scikit-learn to implement these kind of algorithms. The main points I wanted to get across in my talk were

1. How the Bayesian optimization algorithm works; and
2. How the algorithm can be used in your day-to-day work.

I have uploaded the slides to GitHub, and you can find them here.

It seems that there is interest in using these optimization methods, but that there are still a lot of difficulties in properly applying these algorithms. Especially the fact that you need to tune the optimization algorithm itself makes it non-trivial to apply these successfully in practice.

## Sequential model-based algorithms (SMBOs) using hyperopt

In my talk, there was not a lot of time to dive into some of the more production-ready software packages for sequential model-based optimization algorithms. Lately, I spent some time working with the package hyperopt1, and its API is actually easy to use. It is also straightforward to make hyperopt work with scikit-learn estimators. By treating the model type as a hyperparameter, we can even build an optimization that not only optimizes the hyperparameters of a model, but also the type of model itself.

1. J. Bergstra, D. Yamins, and D. D. Cox. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures., ICML (1) 28 (2013): 115-123., https://www.jmlr.org/proceedings/papers/v28/bergstra13.pdf.

# Bayesian optimization with scikit-learn

Choosing the right parameters for a machine learning model is almost more of an art than a science. Kaggle competitors spend considerable time on tuning their model in the hopes of winning competitions, and proper model selection plays a huge part in that. It is remarkable then, that the industry standard algorithm for selecting hyperparameters, is something as simple as random search.

The strength of random search lies in its simplicity. Given a learner $\mathcal{M}$, with parameters $\mathbf{x}$ and a loss function $f$, random search tries to find $\mathbf{x}$ such that $f$ is maximized, or minimized, by evaluating $f$ for randomly sampled values of $\mathbf{x}$. This is an embarrassingly parallel algorithm: to parallelize it, we simply start a grid search on each machine separately.

This algorithm works well enough, if we can get samples from $f$ cheaply. However, when you are training sophisticated models on large data sets, it can sometimes take on the order of hours, or maybe even days, to get a single sample from $f$. In those cases, can we do any better than random search? It seems that we should be able to use past samples of $f$, to determine for which values of $\mathbf{x}$ we are going to sample $f$ next.

## Bayesian optimization

There is actually a whole field dedicated to this problem, and in this blog post I’ll discuss a Bayesian algorithm for this problem. I’ll go through some of the fundamentals, whilst keeping it light on the maths, and try to build up some intuition around this framework. Finally, we’ll apply this algorithm on a real classification problem using the popular Python machine learning toolkit scikit-learn. If you’re not interested in the theory behind the algorithm, you can skip straight to the code, and example, by clicking here.