Mutual information-based feature selection

Although model selection plays an important role in learning a signal from some input data, it is arguably even more important to give the algorithm the right input data. When building a model, the first step for a data scientist is typically to construct relevant features by doing appropriate feature engineering. The resulting data set, which is typically high-dimensional, can then be used as input for a statistical learner.

Although we’d like to think of these learners as smart, and sophisticated, algorithms, they can be fooled by all the weird dependencies present in your data. A data scientist has to make the signal as easily identifiable as possible for the model to learn it. In practice, this means that feature selection is an important preprocessing step. Feature selection helps to zone in on the relevant variables in a data set, and can also help to eliminate collinear variables. It helps reduce the noise in the data set, and it helps the model pick up the relevant signals.

Read more

Pydata London 2017 and hyperopt

Last week I attended the PyData London conference, where I gave a talk about Bayesian optimization. The talk was based on my previous post on using scikit-learn to implement these kind of algorithms. The main points I wanted to get across in my talk were

  1. How the Bayesian optimization algorithm works; and
  2. How the algorithm can be used in your day-to-day work.

I have uploaded the slides to GitHub, and you can find them here.

It seems that there is interest in using these optimization methods, but that there are still a lot of difficulties in properly applying these algorithms. Especially the fact that you need to tune the optimization algorithm itself makes it non-trivial to apply these successfully in practice.

Sequential model-based algorithms (SMBOs) using hyperopt

In my talk, there was not a lot of time to dive into some of the more production-ready software packages for sequential model-based optimization algorithms. Lately, I spent some time working with the package hyperopt1, and its API is actually easy to use. It is also straightforward to make hyperopt work with scikit-learn estimators. By treating the model type as a hyperparameter, we can even build an optimization that not only optimizes the hyperparameters of a model, but also the type of model itself.

  1. J. Bergstra, D. Yamins, and D. D. Cox. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures., ICML (1) 28 (2013): 115-123., https://www.jmlr.org/proceedings/papers/v28/bergstra13.pdf. 

Read more

Bayesian optimization with scikit-learn

Choosing the right parameters for a machine learning model is almost more of an art than a science. Kaggle competitors spend considerable time on tuning their model in the hopes of winning competitions, and proper model selection plays a huge part in that. It is remarkable then, that the industry standard algorithm for selecting hyperparameters, is something as simple as random search.

The strength of random search lies in its simplicity. Given a learner \(\mathcal{M}\), with parameters \(\mathbf{x}\) and a loss function \(f\), random search tries to find \(\mathbf{x}\) such that \(f\) is maximized, or minimized, by evaluating \(f\) for randomly sampled values of \(\mathbf{x}\). This is an embarrassingly parallel algorithm: to parallelize it, we simply start a grid search on each machine separately.

This algorithm works well enough, if we can get samples from \(f\) cheaply. However, when you are training sophisticated models on large data sets, it can sometimes take on the order of hours, or maybe even days, to get a single sample from \(f\). In those cases, can we do any better than random search? It seems that we should be able to use past samples of \(f\), to determine for which values of \(\mathbf{x}\) we are going to sample \(f\) next.

Bayesian optimization

There is actually a whole field dedicated to this problem, and in this blog post I’ll discuss a Bayesian algorithm for this problem. I’ll go through some of the fundamentals, whilst keeping it light on the maths, and try to build up some intuition around this framework. Finally, we’ll apply this algorithm on a real classification problem using the popular Python machine learning toolkit scikit-learn. If you’re not interested in the theory behind the algorithm, you can skip straight to the code, and example, by clicking here.

Read more