**Some of the rarely shared trade secrets in machine learning: Original post: on linkedin**

**1. Bootstrap sampling & the magic number 0.63**

**2. Random search or Grid search**

**3. k -fold Cross validation: ***K=10 the mysterious optimal number of folds*

**4****. Selecting Kernels for Support Vector Machines: ***When do linear kernels work even better than the complex polynomial kernels*

**5. Testing ML systems : ***Don't ship it, just because it compiles *

6. **Bias Variance Tradeoff & Debugging models: ***The learning curve - Art of figuring out if you need more instances or more dimensions*

7. **Choosing 'mtry' in a random forest model**

8**. Learning Rates **

9**. Choosing Loss Functions**

==============================================================================================================

1. **Bootstrapping & the magic number 0.63** Bootstrap sampling is performed by randomly sampling data with replacement; which can be easily modeled as binomial trials where an instance from the data being selected be treated as "success". If there are n number of instances in data, probability of 'success' is 1/n and for the failure, its (n-1)/n.

for sample size b, the probability of exactly x number of successes would be,

In the specific case of a bootstrap sample, the sample size b equals the number of instances n. Letting n approach infinity, we get:

for x = 0, the probability is roughly 1/e ~ 0.368 . Thus the probability of the instance being selected atleast once is 1-1/e ~ 0.632

2. **Grid Search or random search :**

Grid search and random search are the methods to tune hyper parameters and look for the optimal one. Grid search is computationally expensive as it checks for all the possible combinations of the parameters specified and evaluates on the same. Lets say if two parameters are A and B , and the possible ranges specified are 0-2 and 0-3 respectively ; The possible combinations in the parameter space in case of grid search would be (0,0) (0,1) (0,2) (0,3) ...........(2,2) (2,3).

Although grid search can be made to run in parallel, still the technique is not computationally very efficient . Whereas, randoms search has a stark difference from grid search; instead of searching over all the possible combinations, it just searches over a random set of points on the grid. Random search was not taken very seriously at the start, it was considered sub-optimal as it does not search over all the possible grid points.

* 60 randomly sampled points from the grid can be just as good as the computationally expensive 'grid search' !*

Lets consider the 5 % interval around the true maximum, now we sample points and check if it lands inside that maximum. Each random draw would have a probability of 0.05 to land in that interval and 0.95 to miss. If we draw n points independently, then the probability that none of them misses would be 1-(0.95^n )

If we look for a minimum of 0.95 probability of success,

1- 0.95^n > 0.95

=> n >=60

**3. k -fold Cross validation: ***K=10 the mysterious optimal number of folds*

Usually a larger value of k is preferred, as it ensures a larger number of performance estimates and training size in case of large k fold would be closer to the complete data set. However, with the increase in k, the overlap between training sets also increases. For example, with 5-fold cross-validation, each training set shares only 3∕4 of its instances with each of the other four training sets whereas with 10-fold cross- validation, each training set shares 8 ∕ 9 of its instances with each of the other 9 training sets.

Also , with the increasing K, the test data size decreases, which causes the performance metrics to be lesser precise. e.g if there are 10 instances in test data, one can only measure accuracy to the nearest 10% , but if the test data has 20 instances, the accuracy can be measured up to the nearest 5%.

*Considering all these factors, k=10 is considered to be the ideal number of folds used for cross validation, something which is observed to **perform well in most of the scenarios . Can this hypothesis be mathematically proved as well?*

**4. How to apply SVM on a real world example?**

a.Convert categorical variables into numeric data by using binary dummies.

For example, if there is a variable gender. There are two possible values for this categorical attribute. We can create binary dummies Gender_male and gender_female as two variables replacing the variable gender. Thus each of the instance where the “Gender” was male , in place of that we will Gender_male as 1 and gender_female as 0 .

b. *Scale the data appropriately:*

it is important to scale or normalize the data before applying SVM to counter any computational challenges.Also, it may lead to issues if there is huge variance in data.

*Selecting the kernel function:*

You can randomly select a kernel function and estimate the best parameters for the chosen kernel. Test the performance and further tune or select a different kernel function. Normally, radial basis function RBF is the first choice of kernel. It performs reasonably well when the decision boundary is non-linear. As it can effectively map the samples into the high dimensional space, which linear classifier can not do. Also, the number of hyper parameters that we need to optimize in order to get the best set of parameters to apply in case of RBF is lesser than the other polynomial kernel functions.

It is also important to note that, when the dimensions are very high compared to the number of instances available to train on, linear kernel would work better than RBF.

**When to apply a linear kernel?** 1. When number of dimensions is significantly larger than number of instances: In such a case , an explicit mapping to high dimensional space does not add any value. Thus even a linear kernel performs better.

2. When the dimensions as well as instances are large in number.

3. Number of instances is significantly larger than the number of dimensions.

**5. Testing ML systems : **Don't ship it, just because it compiles

*It is important to perform tests on ML systems before putting into production:*

*For a generative learner*,write a generator and generate training/test data from the assumed distribution

– The learner should (usually) recover the actual parameters of the generator, given enough data

– Test it on the abnormal cases(eg, uniform class priors, highly skewed multinomials)

*For a discriminative learner*, apply the same as above, and also check for :

Does taking one gradient step (on a sample task) lower the loss on the training data?

Does it lower the loss as expected?

(f(x)-f(x+d))/d should approximate f’(x)

Does regularization work as expected?

Record training set/test set loss – with and without regularization

6. **Bias Variance Tradeoff & Debugging models**

**** to be continued....**

** **

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

**Technical**

- Free Books and Resources for DSC Members
- Learn Machine Learning Coding Basics in a weekend
- New Machine Learning Cheat Sheet | Old one
- Advanced Machine Learning with Basic Excel
- 12 Algorithms Every Data Scientist Should Know
- Hitchhiker's Guide to Data Science, Machine Learning, R, Python
- Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Pyth...
- How to Automatically Determine the Number of Clusters in your Data
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- Fast Combinatorial Feature Selection with New Definition of Predict...
- 10 types of regressions. Which one to use?
- 40 Techniques Used by Data Scientists
- 15 Deep Learning Tutorials
- R: a survival guide to data science with R

**Non Technical**

- Advanced Analytic Platforms - Incumbents Fall - Challengers Rise
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- How to Become a Data Scientist - On your own
- 16 analytic disciplines compared to data science
- Six categories of Data Scientists
- 21 data science systems used by Amazon to operate its business
- 24 Uses of Statistical Modeling
- 33 unusual problems that can be solved with data science
- 22 Differences Between Junior and Senior Data Scientists
- Why You Should be a Data Science Generalist - and How to Become One
- Becoming a Billionaire Data Scientist vs Struggling to Get a $100k Job
- Why do people with no experience want to become data scientists?

**Articles from top bloggers**

- Kirk Borne | Stephanie Glen | Vincent Granville
- Ajit Jaokar | Ronald van Loon | Bernard Marr
- Steve Miller | Bill Schmarzo | Bill Vorhies

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives**: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central