I try to keep my eye out for articles written by data scientists in other countries, especially those we don’t hear from all that often. What I’m looking for is any difference in perspective about our field. Are the approaches to data problem solving colored by some cultural context? So far I’m happy and yes a little relieved to report that no they are not.
In this recent article written by Mikio Braun, a PostDoc data scientist at the TU Berlin, he talks about “Three Things About Data Science You Won’t Find In the Books”. It seems his frustrations and observations are the same as they are for those of us practicing in the US. Here are some highlights.
1. Evaluation Is Key
The main goal in data analysis/machine learning/data science (or however you want to call is), is to build a system which will perform well on future data. You want to be sure that the method works well and produces the same kind of results you have seen on your original data set.
A mistake often done by beginners is to just look at the performance on the available data and then assume that it will work just as well on future data. Unfortunately that is seldom the case.
So the proper way to evaluate is to simulate the effect that you have future data by splitting the data, training on one part and then predicting on the other part. Usually, the training part is larger, and this procedure is also iterated several times in order to get a few numbers to see how stable the method is. The resulting procedure is called cross-validation.
Still, a lot can go wrong, especially when the data is non-stationary, that is, the underlying distribution of the data is changing over time.
Or there is a lot of correlation between the data points, meaning that if you know one data point you already know a lot about another data point. For example, if you take stock prices, they usually don’t jump around a lot from one day to the other, so that doing the training/test split randomly by day leads to training and test data sets which are highly correlated.
Whenever that happens, you will get performance numbers which are overly optimistic, and your method will not work well on true future data. In the worst case, you’ve finally convinced people to try out your method in the wild, and then it stops working, so learning how to properly evaluate is key!
2. It’s All In The Feature Extraction
Learning about a new method is exciting and all, but the truth is that most complex methods essentially perform the same, and that the real difference is made by the way in which raw data is turned into features used in learning.
Modern learning methods are pretty powerful, easily dealing with tens of thousands of features and hundreds of thousands of data points, but the truth is that in the end, these methods are pretty dumb. Especially methods that learn a linear model (like logistic regression, or linear support vector machines) are essentially as dumb as your calculator.
They are really good at identifying the informative features given enough data, but if the information isn’t in there, or not representable by a linear combination of input features, there is little they can do. They are also not able to do this kind of data reduction themselves by having “insights” about the data.
Put differently, you can massively reduce the amount of data you need by finding the right features.
This means two things: First of all, you should make sure that you master one of those nearly equivalent methods, but then you can stick with them. So you don’t really need logistic regression and linear SVMs, you can just pick one.
Second of all, you should learn all about feature engineering. Unfortunately, this is more of an art, and almost not covered in any of the textbooks because there is so little theory to it.
I know that textbooks often sell methods as being so powerful that you can just throw data against them and they will do the rest. Which is maybe also true from a theoretical viewpoint and an infinite source of data. But in reality, data and our time is finite, so finding informative features is absolutely essential.
3. Model Selection Burns Most Cycles, Not Data Set Sizes
Now this is something you don’t want to say too loudly in the age of Big Data, but most data sets will perfectly fit into your main memory. And your methods will probably also not take too long to run on the data. But you will spend a lot of time extracting features from the raw data and running cross-validation to compare different feature extraction pipelines and parameters for your learning method.
For model selection, you go through a large number of parameter combinations, evaluating the performance on identical copies of the data.
The problem is all in the combinatorial explosion. Let’s say you have just two parameters, and it takes about a minute to train your model and get a performance estimate on the hold out data set (properly evaluated as explained above). If you have five candidate values for each of the parameters, and you perform 5-fold cross-validation (splitting the data set into five parts and running the test five times, using a different part for testing in each iteration), this means that you will already do 125 runs to find out which method works well, and instead of one minute you wait about two hours.
The good message here is that this is easily parallelizable, because the different runs are entirely independent of one another. The same holds for feature extraction where you usually apply the same operation (parsing, extraction, conversion, etc.) to each data set independently, leading to something which is called “embarrassingly parallel” (yes, that’s a technical term).
The bad message here is mostly for the Big Data guys, because all of this means that there is seldom the need for scalable implementations of complex methods, but already running the same undistributed algorithm on data in memory in parallel would be very helpful in most cases.
Finally, having lots of data by itself does not mean that you really need all the data, either. The question is much more about the complexity of the underlying learning problem. If the problem can be solved by a simple model, you don’t need that much data to infer the parameters of your model. In that case, taking a random subset of the data might already help a lot. And as I said above, sometimes, the right feature representation can also help tremendously in bringing down the number of data points needed.
In summary, knowing how to evaluate properly can help a lot to reduce the risk that the method won’t perform on future data. Getting the feature extraction right is maybe the most effective lever to pull to get good results, and finally, it doesn’t always to have Big Data, although distributed computation can help to bring down training times.