# Mutual information-based feature selection

07 Oct 2017Although model selection plays an important role in learning a signal from some input data, it is arguably even more important to give the algorithm the *right* input data. When building a model, the first step for a data scientist is typically to construct relevant features by doing appropriate feature engineering. The resulting data set, which is typically high-dimensional, can then be used as input for a statistical learner.

Although we’d like to think of these learners as smart, and sophisticated, algorithms, they can be fooled by all the weird dependencies present in your data. A data scientist has to make the signal as easily identifiable as possible for the model to learn it. In practice, this means that **feature selection** is an important preprocessing step. Feature selection helps to zone in on the relevant variables in a data set, and can also help to eliminate collinear variables. It helps reduce the noise in the data set, and it helps the model pick up the relevant signals.

## Filter methods

In the above setting, we typically have a high dimensional data matrix , and a target variable (discrete or continuous). A feature selection algorithm will select a subset of columns, , that are most relevant to the target variable .

In general, we can divide feature selection algorithms as belonging to one of three classes:

**Wrapper methods**use learning algorithms on the original data , and selects relevant features based on the (out-of-sample) performance of the learning algorithm. Training a random forest on the data , and selecting relevant features based on the feature importances would be an example of a wrapper model.**Filter methods**do not use a learning algorithm on the original data , but only consider statistical characteristics of the input data. For example, we can select the features for which the correlation between the feature and the target variable exceeds a correlation threshold.**Embedded methods**are a catch-all group of techniques which perform feature selection as part of the model construction process. The LASSO is an example of an embedded method.

In this blog post I will focus on filter methods, and in particular I’ll look at filter methods that use an entropy measure called **mutual information** to assess which features should be included in the reduced data set . The resulting criterion results in an NP-hard optimisation problem, and I’ll discuss several ways in which we can try to find optimal solutions to the problem.