.

For any machine learning problem, say a classifier in this case, it’s always handy to create quickly a base line classifier against which we can compare our new models. You don’t want to spend a lot of time creating these base line classifiers; you would rather spend that time in building and validating new features for your final model. In this post we will see how we can rapidly create base line classifier using scikit learn package for any dataset.

Let us use **iris** dataset for demonstration purpose.

`# Load Libraries import numpy as np from sklearn import datasets # Let us use Iris dataset iris = datasets.load_iris() x = iris.data y = iris.target `

Scikit provides the class *DummyClassifier* to help us create our base line model rapidly. Module **sklearn.dummy** has the DummyClassifier class. Its api interfaces are very similar to any other model in scikit learn, use the *fit* function to build the model and *predict* function to perform classification.

`from sklearn.dummy import DummyClassifier dummy = DummyClassifier(strategy='stratified', random_state = 100, constant = None) dummy.fit(x, y) `

Let us look at the parameters while initializing *DummyClassifier*. The first parameter *strategy* is used to define the modus operandi of our Dummy Classifier. In the example above we have selected *stratified* as the strategy. According to this strategy, the classifier looks at the class distribution in our target variable to make its predictions.

In Iris data set case, there are 150 total records and three classes. The classes are uniformly distributed, there are 50 records per each class. 33.33% is the class distribution.

`## Number of classes print "Number of classes :{}".format(dummy.n_classes_) ## Number of classes assigned to each tuple print "Number of classes assigned to each tuple :{}".format(dummy.n_outputs_) ### Prior distribution of the classes. print "Prior distribution of the classes {}".format(dummy.class_prior_) Number of classes :3 Number of classes assigned to each tuple :1 Prior distribution of the classes [ 0.33333333 0.33333333 0.33333333] `

Under the hood scikit learn uses *multinomial* distribution leveraging the class distribution values for performing the predictions.

The multinomial distribution is a multivariate generalisation of the binomial distribution. Take an experiment with one of p possible outcomes, in our case 3 possible outcomes.

Let us use *multinomial* distribution from numpy to show how scikit learn does it internally.

`output = np.random.multinomial(1, [.33,.33,.33], size = x.shape[0]) predictions = output.argmax(axis=1) print output[0] print predictions[1] `

Oh la.. our predictions are ready. Our *output* variable is a matrix of size (150,3), one dimension for each class. In the next line we use the argmax function to get the index which is set to 1. This is our predicted class. Now that we know what is happening under the hood, let us call the *predict* function and print some accuracy metrics for our dummy classifier.

`y_predicted = dummy.predict(x) print y_predicted # Find model accuracy print "Model accuracy = %0.2f"%(accuracy_score(y,y_predicted) * 100) + "%\n" # Confusion matrix print "Confusion Matrix" print confusion_matrix(y, y_predicted, labels=list(set(y))) `

In the case of *most_frequent*, the dummy classifier will always predict the label which occurs most frequently in the training set.

Will always predict the label provided by the user. While initializing the *DummyClassifier* class, we need to provide this label as a parameter, parameter name : constant.

Let us look a the models generated when our dataset is imbalanced. The most_frequent strategy we discussed will return a biased classifier, as they will tend to pick up the majority class. The accuracy score in this case will be proportional to the majority class ratio. Let us simulate an imbalanced dataset and create our dummy classifiers with different strategies.

`from sklearn.datasets import make_classification x, y = make_classification(n_samples = 100, n_features = 10, weights = [0.3, 0.7]) # Print the class distribution print np.bincount(y) `

As you can see the final print output is **[31 69]**, we have an imbalanced dataset. Let us proceed to build our dummy models with *most_frequent* and *stratified* strategies.

` dummy1 = DummyClassifier(strategy='stratified', random_state = 100, constant = None) dummy1.fit(x, y) y_p1 = dummy1.predict(x) dummy2 = DummyClassifier(strategy='most_frequent', random_state = 100, constant = None) dummy2.fit(x, y) y_p2 = dummy2.predict(x) `

And finally some metrics for our models.

`from sklearn.metrics import precision_score, recall_score, f1_score print print "########## Metrics #################" print " Dummy Model 1, strategy: stratified, accuracy {0:.2f}, precision {0:.2f}, recall {0:.2f}, f1-score {0:.2f}"\ .format(accuracy_score(y, y_p1), precision_score(y, y_p1), recall_score(y, y_p1), f1_score(y, y_p1)) print " Dummy Model 2, strategy: most_frequent, accuracy {0:.2f}, precision {0:.2f}, recall {0:.2f}, f1-score {0:.2f}"\ .format(accuracy_score(y, y_p2), precision_score(y, y_p2), recall_score(y, y_p2), f1_score(y, y_p2)) print ########## Metrics ################# Dummy Model 1, strategy: stratified, accuracy 0.58, precision 0.58, recall 0.58, f1-score 0.58 Dummy Model 2, strategy: most_frequent, accuracy 0.69, precision 0.69, recall 0.69, f1-score 0.69 `

Dummy Model 2, as expected is biased towards the majority class. Dummy model 1 may serve as a realistic baseline to begin with.

Strategy *prior* is very similar to stratified as it uses the prior distribution of classes.

`dummy3 = DummyClassifier(strategy='prior', random_state = 100, constant = None) dummy3.fit(x, y) y_p3 = dummy3.predict(x) print " Dummy Model 3, strategy: prior, accuracy {0:.2f}, precision {0:.2f}, recall {0:.2f}, f1-score {0:.2f}"\ .format(accuracy_score(y, y_p3), precision_score(y, y_p3), recall_score(y, y_p3), f1_score(y, y_p3)) `

Strategy *uniform* generates predictions uniformly at random.

`dummy4 = DummyClassifier(strategy='uniform', random_state = 100, constant = None) dummy4.fit(x, y) y_p4 = dummy4.predict(x) print " Dummy Model 4, strategy: uniform, accuracy {0:.2f}, precision {0:.2f}, recall {0:.2f}, f1-score {0:.2f}"\ .format(accuracy_score(y, y_p4), precision_score(y, y_p4), recall_score(y, y_p4), f1_score(y, y_p4)) `

That is all for this post. Will continue with another post explaining *DummyRegressor*.

For original post, click here

**I**

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central