Ten General Principles in Data Mining/Science

Through years, working with different clients and applications, I have found a set of data mining general principles that also hold through in the context of big data.  These are all listed in my book. Here I enumerate them using the terminology I have used in the book:
  1. Use of “all the data” is not equivalent to building the deepest Analytics Dataset (ADS) in terms of the number of rows.
  2. Given a choice between more of the data or a fancier algorithm, choosing “more data” at the top of the data preparation funnel is always preferred.
  3. For most problems, when using the same learning algorithm, smart variable creation on a proper sample of data (a sampled ADS) outperforms the use of the deepest ADS, with primary and/or rudimentary (not well-thought) attributes.
  4. For most problems, smart variable creation combined with a simple algorithm outperforms a mediocre variable creation exercise combined with the fanciest algorithms; whatever the ADS row size.
  5. For some classes of algorithms, a smart ADS combined with a smart presentation of its variables to the learning algorithm on sampled data outperforms a smart ADS with the largest number of rows without proper presentation of variables.
  6. In many problems, one has to deal with transactional data requiring creation of time-based  variables that provide the learner with a short and long term memory of the past behavior of the entities to model. Depending on the problem, such variables need to be computed and updated from the transaction history of each entity and for every transaction in realtime or at specific time intervals.
  7. For a fixed model complexity, as the number of rows (observations) in ADS increases, the training and test errors converge.
  8. For some problems, all population must be represented in an ADS (e.g., social net analysis, long tail problems, high cardinality recommenders, search). For all other problems, sampling continues to be valid. For a subset of these problems, sampling is mandatory, e.g., highly unbalanced datasets, segmented modeling, micro-modeling, and campaign groups. For the remainder, it is optional but not a limiting factor anymore. Historically for these problems, sampling had to be done to speed up the processing or to reduce the storage cost.
  9. For big data, it is desired to use the same platform and interface for data understanding, preparation, and model development with minimal data movement and least iterations through the data to get to the result.
  10. In the transition from model development to deployment, automatic code generation for computation of variables and models is of high importance to ensure quality control. Automatic code generation is mandatory in applications that require a large number of models.

Read more here and also here.

Views: 4188

Tags: Analytics, Big, Data, High, Mining, Performance, analytics, data, hassibi, high, More…k., khosrow, learning, machine, pattern, performance, recognition, science


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Adrian Walker on July 18, 2015 at 2:51am

The following knowledge meta-app may be of interest.

It's a platform on the Web that's intended to support people socially writing their own big data apps, by typing _executable_ English knowledge into browsers.

For example, here's the "source code" of an app written in executable English:


Anyone on the web can view and run the app by pointing a browser to the site.   They can also edit the app, Wikipedia style (but without those severe human editors).  And of course, you can also write and run new apps.

Since the apps are written in English, they are findable via Google.  For example, searching Google for

                  imported oil energy some-source

finds the app mentioned above.

Shared use of the platform is free, by pointing a browser to www.reengineeringllc.com , and there are no advertisements.

Apologies if you have seen this before, and thanks for comments.

                                                    -- Adrian
Internet Business Logic
A Wiki and SOA Endpoint for Executable Open Vocabulary English Q/A
Online at www.reengineeringllc.com   
Shared use is free, and there are no advertisements

Adrian Walker

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service