# Pydata London 2017 and hyperopt

Last week I attended the PyData London conference, where I gave a talk about Bayesian optimization. The talk was based on my previous post on using scikit-learn to implement these kind of algorithms. The main points I wanted to get across in my talk were

1. How the Bayesian optimization algorithm works; and
2. How the algorithm can be used in your day-to-day work.

I have uploaded the slides to GitHub, and you can find them here.

It seems that there is interest in using these optimization methods, but that there are still a lot of difficulties in properly applying these algorithms. Especially the fact that you need to tune the optimization algorithm itself makes it non-trivial to apply these successfully in practice.

## Sequential model-based algorithms (SMBOs) using hyperopt

In my talk, there was not a lot of time to dive into some of the more production-ready software packages for sequential model-based optimization algorithms. Lately, I spent some time working with the package hyperopt1, and its API is actually easy to use. It is also straightforward to make hyperopt work with scikit-learn estimators. By treating the model type as a hyperparameter, we can even build an optimization that not only optimizes the hyperparameters of a model, but also the type of model itself.

1. J. Bergstra, D. Yamins, and D. D. Cox. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures., ICML (1) 28 (2013): 115-123., https://www.jmlr.org/proceedings/papers/v28/bergstra13.pdf.

# Bayesian optimization with scikit-learn

Choosing the right parameters for a machine learning model is almost more of an art than a science. Kaggle competitors spend considerable time on tuning their model in the hopes of winning competitions, and proper model selection plays a huge part in that. It is remarkable then, that the industry standard algorithm for selecting hyperparameters, is something as simple as random search.

The strength of random search lies in its simplicity. Given a learner $\mathcal{M}$, with parameters $\mathbf{x}$ and a loss function $f$, random search tries to find $\mathbf{x}$ such that $f$ is maximized, or minimized, by evaluating $f$ for randomly sampled values of $\mathbf{x}$. This is an embarrassingly parallel algorithm: to parallelize it, we simply start a grid search on each machine separately.

This algorithm works well enough, if we can get samples from $f$ cheaply. However, when you are training sophisticated models on large data sets, it can sometimes take on the order of hours, or maybe even days, to get a single sample from $f$. In those cases, can we do any better than random search? It seems that we should be able to use past samples of $f$, to determine for which values of $\mathbf{x}$ we are going to sample $f$ next.

## Bayesian optimization

There is actually a whole field dedicated to this problem, and in this blog post I’ll discuss a Bayesian algorithm for this problem. I’ll go through some of the fundamentals, whilst keeping it light on the maths, and try to build up some intuition around this framework. Finally, we’ll apply this algorithm on a real classification problem using the popular Python machine learning toolkit scikit-learn. If you’re not interested in the theory behind the algorithm, you can skip straight to the code, and example, by clicking here.

# The fundamental problem of causal analysis

“Correlation does not imply causation” is one of those principles every person that works with data should know. It is one of the first concepts taught in any introduction to statistics class. There is a good reason for this, as most of the work of a data scientist, or a statistician, does actually revolve around questions of causation:

• Did customers buy into product X or service Y because of last weeks email campaign, or would they have converted regardless of whether we did or did not run the campaign?
• Was there any effect of in-store promotion Z on the spending behavior of customers four weeks after the promotion?
• Did people with disease X got better because they took treatment Y, or would they have gotten better anyways?

Being able to distinguish between spurious correlations, and true causal effects, means a data scientist can truly add value to the company.

This is where traditional statistics, like experimental design, comes into play. Although it is perhaps not commonly associated with the field of data science, more and more data scientists are using principles from experimental design. Data scientists at Twitter use these principles to correct for hidden bias in their A/B tests, engineers from Google have developed a whole R package1 around causal analysis, and at Tesco we use these principles to attribute changes in customer spending behavior to promotions customers participated in.

In this post we will have a look at some of the frequently used methods in causal analysis. First, we will go through a little bit of theory, and talk about why we need causal analysis in the first place (the fundamental problem of causal analysis). I will then introduce you to propensity score matching methods, which are one way of dealing with observational data sets. We will wrap up with a discussion about other methods, and I have also put up an IPython notebook that walks you through an example data set.

1. Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L. 2015. Inferring causal impact using Bayesian structural time-series models. The Annals of Applied Statistics, 9(1), 247-274. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41854.pdf

# Long-term forecasting with machine learning models

Time series analysis has been around for ages. Even though it sometimes does not receive the attention it deserves in the current data science and big data hype, it is one of those problems almost every data scientist will encounter at some point in their career. Time series problems can actually be quite hard to solve, as you deal with a relatively small sample size most of the time. This usually means an increase in the uncertainty of your parameter estimates or model predictions.

A common problem in time series analysis is to make a forecast for the time series at hand. An extensive theory around on the different types of models you can use for calculating a forecast of your time series is already available in the literature. Seasonal ARIMA models and state-space models are quite standard methods for these kinds of problems. I recently had to provide some forecasts and in this blog post I’ll discuss some of the different approaches I considered.

The difference with my previous encounters with time series analyses was that now I had to provide longer term forecasts (which in itself is an ambiguous term, as it depends on the context) for a large number of time series (~500K). This prevented me from using some of the classical methods mentioned before, because

1. classical ARIMA models are typically well-suited for short-term forecasts, but not for longer term forecasts due to the convergence of the autoregressive part of the model to the mean of the time series; and
2. the MCMC sampling algorithms for some of the Bayesian state-space models can be computationally heavy. Since I needed forecasts for a lot of time series quickly this ruled out these type of algorithms.

Instead, I opted for a more algorithmic point of view, as opposed to a statistical one, and decided to try out some machine learning methods. However, most of these methods are designed for independent and identically distributed (IID) data, so it is interesting to see how we can apply these models to non-IID time series data.