A myriad of options exist for classification. In general, there isn't a single "best" option for every situation. That said, three popular classification methods— Decision Trees, k-NN & Naive Bayes—can be tweaked for practically every situation.
Naive Bayes and K-NN, are both examples of supervised learning (where the…Continue
Added by Stephanie Glen on June 19, 2019 at 6:49am — No Comments
Added by Stephanie Glen on June 15, 2019 at 7:53am — No Comments
R-squared can help you answer the question "How does my model perform, compared to a naive model?". However, r2 is far from a perfect tool. Probably the main issue is that every data set contains a certain amount of unexplainable data. R-squared can't tell the difference between the explainable and the…Continue
Added by Stephanie Glen on June 10, 2019 at 5:30am — No Comments
Added by Stephanie Glen on May 31, 2019 at 8:00am — No Comments
Cross Validation explained in one simple picture. The method shown here is k-fold cross validation, where data is split into k folds (in this example, 5 folds). Blue balls represent training data; 1/k (i.e. 1/5) balls are held back for model testing.
Monte Carlo cross validation works the same way, except that the balls would be chosen with replacement. In other words, it would be possible for a ball to appear in more than one sample.…Continue
Added by Stephanie Glen on May 25, 2019 at 8:30am — No Comments
Confidence intervals (CIs) tell you how much uncertainty a statistic has. The intervals are connected to confidence levels and the two terms are easily confused, especially if you're new to statistics. Confidence Intervals in One Picture is an intro to CIs, and explains how each part interacts with margins of error and where the different components come…Continue
Added by Stephanie Glen on May 17, 2019 at 10:00am — No Comments
The lifecycle of data travels through six phases:
The lifecycle "wheel" isn't set in stone. While it's common to move through the phases in order, it's possible to move in either direction (i.e. forward, backward) at any stage in the cycle. Work can also happen in several phases at the same time, or you can skip over…Continue
If you want to determine the optimal number of clusters in your analysis, you're faced with an overwhelming number of (mostly subjective) choices. Note that there's no "best" method, no "correct" k, and there isn't even a consensus as to the definition of what a "cluster" is. With that said, this picture focuses on three popular methods that should fit almost every need: Silhouette, Elbow, and Gap Statistic.…Continue
Added by Stephanie Glen on April 28, 2019 at 12:30am — No Comments
Naive Bayes is a deceptively simple way to find answers to probability questions that involve many inputs. For example, if you're a website owner, you might be interested to know the probability that a visitor will make a purchase. That question has a lot of "what-ifs", including time on page, pages visited, and prior visits. Naive Bayes essentially allows you to take the raw inputs (i.e. historical data), sort the data into more meaningful chunks, and input them into a formula. …Continue
Added by Stephanie Glen on April 25, 2019 at 10:00am — No Comments
Bayes’ Theorem is a way to calculate conditional probability. The formula is very simple to calculate, but it can be challenging to fit the right pieces into the puzzle. The first challenge comes from defining your event (A) and test (B); The second challenge is rephrasing your question so that you can work backwards: turning P(A|B) into P(B|A). The following image shows a…Continue
Added by Stephanie Glen on April 12, 2019 at 6:30am — No Comments
A non-technical look at A/B testing, based on Dan Siroker & Pete Koomen's book, A / B Testing, The Most Powerful Way to Turn Clicks Into Customers.
Perhaps the two most important points:
Added by Stephanie Glen on April 3, 2019 at 4:30pm — No Comments
Ensemble methods take several machine learning techniques and combine them into one predictive model. It is a two step process:
Added by Stephanie Glen on March 27, 2019 at 3:30pm — No Comments
SVMs (Support Vector Machines) are a way to classify data by finding the optimal plane or hyperplane that separates the data. In 2D, the separation is a plane; In higher dimensions, it's a hyperplane. For simplicity, the following picture shows how SVM works for a two-dimensional set.
Click on picture to zoom…Continue
Logistic regression is regressing data to a line (i.e. finding an average of sorts) so you can fit data to a particular equation and make predictions for your data. This type of regression is a good choice when modeling binary variables, which happen frequently in real life (e.g. work or don't work, marry or don't marry, buy a house or rent...). The logistic regression model is…Continue
Added by Stephanie Glen on March 22, 2019 at 11:30am — No Comments
This is a simple overview of the k-NN process. Perhaps the most challenging step is finding a k that's "just right". The square root of n can put you in the ballpark, but ideally you should use a training set (i.e. a nicely categorized set) to find a "k" that works for your data. Remove a few categorized data points and make them "unknowns", testing a few values for k to see what works.…Continue
Determining sample sizes is a challenging undertaking. For simplicity, I've limited this picture to the one of the most common testing situation: testing for differences in means. Some assumptions have been made (for example, normality and…Continue
The EM algorithm finds maximum-likelihood estimates for model parameters when you have incomplete data. The "E-Step" finds probabilities for the assignment of data points, based on a set of hypothesized probability…Continue
Added by Stephanie Glen on March 9, 2019 at 9:00am — No Comments
There are dozens of different hypothesis tests, so choosing one can be a little overwhelming. The good news is that one of the more popular tests will usually do the trick--unless you have unusual data or are working within very specific guidelines (i.e. in medical research). The following picture shows several tests for a single population, and what…Continue
Added by Stephanie Glen on March 7, 2019 at 7:30am — No Comments
In the nascent field of Data Science, myths are abound. Here's my top 10, scoured from the internet (where better than to find a myth or two?).
This one is only part myth. Historically, women have been discouraged from entering the computing sciences for many reasons unrelated to talent (see my previous post,…Continue