The fundamental problem of causal analysis

“Correlation does not imply causation” is one of those principles every person that works with data should know. It is one of the first concepts taught in any introduction to statistics class. There is a good reason for this, as most of the work of a data scientist, or a statistician, does actually revolve around questions of causation:

  • Did customers buy into product X or service Y because of last weeks email campaign, or would they have converted regardless of whether we did or did not run the campaign?
  • Was there any effect of in-store promotion Z on the spending behavior of customers four weeks after the promotion?
  • Did people with disease X got better because they took treatment Y, or would they have gotten better anyways?

Being able to distinguish between spurious correlations, and true causal effects, means a data scientist can truly add value to the company.

This is where traditional statistics, like experimental design, comes into play. Although it is perhaps not commonly associated with the field of data science, more and more data scientists are using principles from experimental design. Data scientists at Twitter use these principles to correct for hidden bias in their A/B tests, engineers from Google have developed a whole R package1 around causal analysis, and at Tesco we use these principles to attribute changes in customer spending behavior to promotions customers participated in.

In this post we will have a look at some of the frequently used methods in causal analysis. First, we will go through a little bit of theory, and talk about why we need causal analysis in the first place (the fundamental problem of causal analysis). I will then introduce you to propensity score matching methods, which are one way of dealing with observational data sets. We will wrap up with a discussion about other methods, and I have also put up an IPython notebook that walks you through an example data set.

  1. Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L. 2015. Inferring causal impact using Bayesian structural time-series models. The Annals of Applied Statistics, 9(1), 247-274. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41854.pdf 

Read more

Long-term forecasting with machine learning models

Time series analysis has been around for ages. Even though it sometimes does not receive the attention it deserves in the current data science and big data hype, it is one of those problems almost every data scientist will encounter at some point in their career. Time series problems can actually be quite hard to solve, as you deal with a relatively small sample size most of the time. This usually means an increase in the uncertainty of your parameter estimates or model predictions.

A common problem in time series analysis is to make a forecast for the time series at hand. An extensive theory around on the different types of models you can use for calculating a forecast of your time series is already available in the literature. Seasonal ARIMA models and state-space models are quite standard methods for these kinds of problems. I recently had to provide some forecasts and in this blog post I’ll discuss some of the different approaches I considered.

The difference with my previous encounters with time series analyses was that now I had to provide longer term forecasts (which in itself is an ambiguous term, as it depends on the context) for a large number of time series (~500K). This prevented me from using some of the classical methods mentioned before, because

  1. classical ARIMA models are typically well-suited for short-term forecasts, but not for longer term forecasts due to the convergence of the autoregressive part of the model to the mean of the time series; and
  2. the MCMC sampling algorithms for some of the Bayesian state-space models can be computationally heavy. Since I needed forecasts for a lot of time series quickly this ruled out these type of algorithms.

Instead, I opted for a more algorithmic point of view, as opposed to a statistical one, and decided to try out some machine learning methods. However, most of these methods are designed for independent and identically distributed (IID) data, so it is interesting to see how we can apply these models to non-IID time series data.

Read more