Through years, working with different clients and applications, I have found a set of data mining general principles that also hold through in the context of big data. These are all listed in my book. Here I enumerate them using the terminology I have used in the book:
- Use of “all the data” is not equivalent to building the deepest Analytics Dataset (ADS) in terms of the number of rows.
- Given a choice between more of the data or a fancier algorithm, choosing “more data” at the top of the data preparation funnel is always preferred.
- For most problems, when using the same learning algorithm, smart variable creation on a proper sample of data (a sampled ADS) outperforms the use of the deepest ADS, with primary and/or rudimentary (not well-thought) attributes.
- For most problems, smart variable creation combined with a simple algorithm outperforms a mediocre variable creation exercise combined with the fanciest algorithms; whatever the ADS row size.
- For some classes of algorithms, a smart ADS combined with a smart presentation of its variables to the learning algorithm on sampled data outperforms a smart ADS with the largest number of rows without proper presentation of variables.
- In many problems, one has to deal with transactional data requiring creation of time-based variables that provide the learner with a short and long term memory of the past behavior of the entities to model. Depending on the problem, such variables need to be computed and updated from the transaction history of each entity and for every transaction in realtime or at specific time intervals.
- For a fixed model complexity, as the number of rows (observations) in ADS increases, the training and test errors converge.
- For some problems, all population must be represented in an ADS (e.g., social net analysis, long tail problems, high cardinality recommenders, search). For all other problems, sampling continues to be valid. For a subset of these problems, sampling is mandatory, e.g., highly unbalanced datasets, segmented modeling, micro-modeling, and campaign groups. For the remainder, it is optional but not a limiting factor anymore. Historically for these problems, sampling had to be done to speed up the processing or to reduce the storage cost.
- For big data, it is desired to use the same platform and interface for data understanding, preparation, and model development with minimal data movement and least iterations through the data to get to the result.
- In the transition from model development to deployment, automatic code generation for computation of variables and models is of high importance to ensure quality control. Automatic code generation is mandatory in applications that require a large number of models.
Read more here and also here.
You need to be a member of Data Science Central to add comments!
Join Data Science Central