Don't fear the rise of automated machine learning

There are a wealth of high-quality research tools available in the machine learning open source community. However, as an industry we still lack standardised tooling that helps us put models into production. A lot of the code we produce can be repetitive, and we are still lacking industry-wide standards for things like storing experiment results, building and versioning models, and tracking model performance over time in production.

In the last few years, there have been a number of start-ups and initiatives addressing these issues. In particular, there has been a noticeable rise in the number of companies (Google’s AutoML, DataRobot, SparkBeyond, SigOpt) and open-source solutions (AutoWEKA, auto-sklearn, SMAC) that provide automatic machine learning solutions (in production).

Read more

Data scientists, the only useful code is production code

The responsibilities of a data scientist can be very diverse, and people have written in the past about the different types of data scientists that exist in the industry. The types of data scientists range from a more analyst-like role, to more software engineering-focused roles. It is partly due to the different responsibilities those jobs require, and the diverse backgrounds data scientists come from, that they sometimes have a bad reputation amongst peers when it comes to writing good quality code. Not everybody comes to data science with a software engineering background.

Regardless of what the responsibilities of a data scientist are, code is a main (by)product of his or her work. Whether the scientist is producing ad-hoc analyses for a business stakeholder, or building a machine learning model sitting behind a RESTful API, the main output is always code. Since most data scientists don’t come from a software engineering background, the quality of that code can vary a lot, causing issues with reproducibility and maintainability later down the line.

Read more

Filtering the noise with stability selection

In the previous blog post, I discussed different types of feature selection methods and I focussed on mutual information based methods. I’ve since done a broader talk on feature selection at PyData London. In the talk, I discussed an example of an embedded feature selection method called stability selection, a method that tends to work well in high-dimensional, sparse, problems.

Embedded methods are a catch-all group of techniques that perform feature selection as part of the model construction process. They typically take the interaction between feature subset search and the learning algorithm into account, at the cost of extra computational time.

Read more