Top Data Mining Algorithms Identified by IEEE & Related Python Resources

IEEE International Conference on Data Mining identified 10 algorithms in 2006 using surveys from past winners and voting. This is a list of those algorithms a short description and related python resources. The detailed paper is given here.



C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier.

C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S = {s_1, s_2, ...} of already classified samples. Each sample s_i consists of a p-dimensional vector (x_{1,i}, x_{2,i}, ...,x_{p,i}) , where the x_j represent attributes or features of the sample, as well as the class in which s_i falls.

At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists.

The C4.5 algorithm is available under SciKit's Decision Trees module. 


k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster

python resources:

k-means clustering is available in scipy, scikit-learn there is also a python wrapper for a basic c implementation.

support vector machines

Support vector machines(SVMs) are supervised learning models with learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. Given a set of marked training examples, an SVM training algorithm builds a model that assigns new examples into one of marked categories.

Python resources:

SVMs are available in scikit-learn, pyml


Apriori algorithm is used for discovering interesting relations between variables in large databases.

python resources:

A python implementation is available


Expectation Maximization

An expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood estimates of parameters in statistical models. It is used in cases where the equations cannot be solved directly.

Python resources:



PageRank is perhaps the most popular one in this list. Its the best known algorithm used by google to rank websites in their search engine results.

It is a link analysis algorithm and it assigns a numerical weighting called page rank to each element of a hyperlinked set of documents, with the purpose of "measuring" its relative importance within the set.


Python Resources:

An implementation in python is available



Adaboost is used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms called the weak learners is combined into a weighted sum that gives final output of the boosted classifier.


Python resources:

available in scikit-learn


k-Nearest Neighbors

k nearest neighbours algorithm for k closest training examples in the feature space  outputs their class membership classified by a majority vote of its k neighbours if used for classification. If used for regression it outputs the average of the values of its k nearest neighbours.


Python resources:

available in scikit-learn


Naïve Bayes

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong independence assumptions between the features.


Python resources:




Classification and regression trees

Python resources:


A continuously updated list of Python resources for these algorithms is available on Pansop

Views: 10573


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service