]]>

A myriad of options exist for classification. In general, there isn't a single "best" option for every situation. That said, three popular classification methods— Decision Trees, k-NN & Naive Bayes—can be tweaked for practically every situation.OverviewNaive Bayes and K-NN, are both examples of supervised learning (where the data comes already labeled). Decision trees are easy to use for small amounts of classes. If you're trying to decide between the three, your best option is to take all three for a test drive on your data, and see which produces the best results.If you're new to classification, a decision tree is probably your best starting point. It will give you a clear visual, and it's ideal to get a grasp on what classification is actually doing. K-NN comes in a close second; Although the math behind it is a little daunting, you can still create a visual of the nearest neighbor process to understand the process. Finally, you'll want to dig into Naive Bayes. The math is complex, but the result is a process that's highly accurate and fast—especially when you're dealing with Big Data.Where Bayes Excels1. Naive Bayes is a linear classifier while K-NN is not; It tends to be faster when applied to big data. In comparison, k-nn is usually slower for large amounts of data, because of the calculations required for each new step in the process. If speed is important, choose Naive Bayes over K-NN.2. In general, Naive Bayes is highly accurate when applied to big data. Don't discount K-NN when it comes to accuracy though; as the value of k in K-NN increases, the error rate decreases until it reaches that of the ideal Bayes (for k→∞). 3. Naive Bayes offers you two hyperparameters to tune for smoothing: alpha and beta. A hyperparameter is a prior parameter that are tuned on the training set to optimize it. In comparison, K-NN only has one option for tuning: the “k”, or number of neighbors. 4. This method is not affected by the curse of dimensionality and large feature sets, while K-NN has problems with both.5. For tasks like robotics and computer vision, Bayes outperforms decision trees.Where K-nn Excels1. If having conditional independence will highly negative affect classification, you'll want to choose K-NN over Naive Bayes. Naive Bayes can suffer from the zero probability problem; when a particular attribute's conditional probability equals zero, Naive Bayes will completely fail to produce a valid prediction. This could be fixed using a Laplacian estimator, but K-NN could end up being the easier choice.2. Naive Bayes will only work if the decision boundary is linear, elliptic, or parabolic. Otherwise, choose K-NN.3. Naive Bayes requires that you known the underlying probability distributions for categories. The algorithm compares all other classifiers against this ideal. Therefore, unless you know the probabilities and pdfs, use of the ideal Bayes is unrealistic. In comparison, K-NN doesn't require that you know anything about the underlying probability distributions.4. K-NN doesn’t require any training—you just load the dataset and off it runs. On the other hand, Naive Bayes does require training.5. K-NN (and Naive Bayes) outperform decision trees when it comes to rare occurrences. For example, if you're classifying types of cancer in the general population, many cancers are quite rare. A decision tree will almost certainty prune those important classes out of your model. If you have any rare occurrences, avoid using decision trees. Where Decision Trees ExcelImage: Decision tree for a mortgage lender.1. Of the three methods, decision trees are the easiest to explain and understand. Most people understand hierarchical trees, and the availability of a clear diagram can help you to communicate your results. Conversely, the underlying mathematics behind Bayes Theorem can be very challenging to understand for the layperson. K-NN meets somewhere in the middle; Theoretically, you could reduce the K-NN process to an intuitive graphic, even if the underlying mechanism is probably beyond a layperson's level of understanding.2. Decision trees have easy to use features to identify the most significant dimensions, handle missing values, and deal with outliers.3. Although over-fitting is a major problem with decision trees, the issue could (at least, in theory) be avoided by using boosted trees or random forests. In many situations, boosting or random forests can result in trees outperforming either Bayes or K-NN. The downside to those add-ons are that they add a layer of complexity to the task and detract from the major advantage of the method, which is its simplicity.More branches on a tree lead to more of a chance of over-fitting. Therefore, decision trees work best for a small number of classes. For example, the above image only results in two classes: proceed, or do not proceed. 4. Unlike Bayes and K-NN, decision trees can work directly from a table of data, without any prior design work.5. If you don't know your classifiers, a decision tree will choose those classifiers for you from a data table. Naive Bayes requires you to know your classifiers in advance. ReferencesDecision tree vs. Naive Bayes classifierComparison of Naive Basian and K-NN ClassifierDoing Data Science: Straight Talk from the FrontlineAn Introduction to Machine LearningMachine Learning ClassifiersSee More

The lifecycle of data travels through six phases:The lifecycle "wheel" isn't set in stone. While it's common to move through the phases in order, it's possible to move in either direction (i.e. forward, backward) at any stage in the cycle. Work can also happen in several phases at the same time, or you can skip over entire phases. In addition, if new information is uncovered, work can return to an earlier phase to start the cycle over again.Note that the term "data lifecycle" is also up for debate, as data doesn't really evolve and grow like a seed or egg would. Some authors add different stages. For example, purging. That addition might not be accurate, as it's not common for data to be deleted out of existence; It's much more likely to be stored or archived (the equivalent of suspended animation). To add to the confusion, different people may call different parts of the wheel something slightly different. For example, "data prep" might be called "data capture".This simplified model gives you a starting point with which to build a data lifecycle that works for your organization.1. DiscoveryIn this initial phase, you'll develop clear goals and a plan of how to achieve those goals. You'll want to identify where your data is coming from, and what story you want your data to tell. If you plan on hypothesis testing your data, this is the stage where you'll develop a clear hypothesis and decide which hypothesis tests you'll use (for an overview, see: hypothesis tests in one picture). One way to think about this phase is that you're focusing on the business requirements, rather than the data itself. Data can be collected in this stage, but you won't be working with the data. Rather, you'll just identify rough or vague areas of data that might be applicable to your goals.I like to think of data collection here as like doing research in a public library. At initial stages of research, you simply grab every book in sight that has some connection to your topic. Then you sit down and sort through the books, casting aside those that aren't particularly relevant. That "sit down" stage is the next step: data prep.2. Data PrepIn this second stage, the focus shifts from business requirements to data requirements. Data prep is every task involved with collecting, processing, and cleansing data. Perhaps one of the most important parts of this step is making sure that the data you need is actually available. Raw data is preferable to aggregate, although both types may be useful for comparison purposes. You may need to make adjustments to the amount or type of data you need, depending on what data is available. In this early phase, data is collected but not analyzed. Data is captured through three main ways:Data acquisition: obtaining existing data from outside sources.Data entry: creating new data values from data inputted within the organization.Signal reception: capturing data created by devices.A distribution and range may be obtained for the data, which forms a natural bridge to the next step. 3. Plan Model (Explore/Transform Data)You've collected your data in Step 2, which may be structured (clearly defined with patterns), unstructured, or semi-structured. Now it's time to load and explore the data at hand.Many techniques are available for loading data. A few examples:ETL (Extract, Transform & Load) transforms data using a set of business rules, before loading it into a sandbox.ELT (Extract, Load, and Transform) loads raw data into the sandbox then transforms the data.ETLT (Extract, Transform, Load, Transform) has two levels of transformation. The first transformation is often used to eliminate noise.If the data is skewed, looking at a logarithmic distribution (assuming all the data is all positive) can help make sense of the data's underlying patterns. Take note of how many modes (peaks) your data has in this phase, as it can give you clues about the underlying populations. A unimodal (single) distribution may indicate a single population, while a multimodal (many peaked) distribution indicates multiple sources. Dirty data can be filtered in this phase, or simply removed. In this stage, you might also use tools and techniques like aggregation, integration, and data scrubbing.4. Build the ModelBuilding a model involves two phases:Design the model: identify a suitable model (e.g. a normal distribution). This step can involve a number of different modeling techniques to identify a suitable model. These may include decision trees, regression techniques (like logistic regression), and neural networks.Execute the model: The model is run against the data to ensure that the model fits the data.5. Communicate Results / Publish InsightsTypically, "communicating results" means that you communicate results within an organization while "publishing" refers to making your results available to entities outside the organization. By publishing your insights, you're effectively making your results impossible to recall. For example, you might send your results out to the public in a market report, or you might just send the results to one newspaper editor. Either way, it's impossible at the publishing stage to recall your findings. Data breaches or hacks also, unfortunately, fall under the umbrella of "publishing."6. Operationalize / Measure EffectivenessThis final phase moves data from the sandbox into a live environment. Data is monitored and analyzed to see if the generated model is creating the expected results. If the results aren't as expected, you can return to any of the preceding phases to tweak the data.ReferencesData Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting DataGetting Started with Greenplum for Big Data AnalyticsData Preparation in the Analytical Life Cycle - Part 17 Phases of a Data Life CyclePhase 6 Innovation Analytics: OperationalizeSee More

]]>

R-squared can help you answer the question "How does my model perform, compared to a naive model?". However, r2 is far from a perfect tool. Probably the main issue is that every data set contains a certain amount of unexplainable data. R-squared can't tell the difference between the explainable and the unexplainable , so it will go on, and on, trying to perfect its goal. If you keep on "perfecting" with r-squared by adding more predictors, you'll end up with misleading results, and reduced precision.Some of the other issues include:R-squared is heavily dependent on the dataset you're feeding into it. While in some cases it can be useful to take a look at what your data is doing, in many "real world" situations you want to know more than just what's in your data; You want predictions, and r-squared is not a predictive tool.While r2 can be an excellent tool for comparing with a naive model, sometimes you might want to know how your model compares to a true model instead, and r-squared isn't able to tell you that.R-squared tends to be a little bipolar; A high r2 might be preferred in some cases, while a low r2 might be preferred in other cases. This can obviously get confusing.Overfitting can be a huge issue. Overfitting is where too many predictors and higher order polynomials in a model lead to random noise being modeled, instead of the real trend.R-squared for a model can go down, even when it is becoming a better approximation to the true model. Pluses and Minuses with R-SquaredIn general, r-squared is often the "go to" because it's easy to use and understand. Practically all statistical software includes it--even basic tools like Excel's data analysis. You simply check a box, et voila! The software gives you the percent variation in y that your model explains.One of the main drawbacks to r2 is that you can keep on adding terms to your model to increase it. Model not quite up to par? Add a few more terms and it will be. Add a few more, and any model--even the bad ones-- could hit 99%. Basically, if you don't know your data inside out (and for large sets of data, you probably won't), then it can be challenging to know when to stop adding terms.AlternativesA perfect alternative to r-squared doesn't exist: every choice has its pluses and minuses.1. Adjusted R-SquaredAdjusted R-Squared is a correction for adding too many terms to the model. It will always be lower than R-squared, and tends to be a better alternative. However, it suffers from many of the same pitfalls of plain old r2. Perhaps the most important drawback is that it isn't predictive, and simply deals with the data you feed into it.2. Predicted R-SquaredPredicted R-squared (PRESS or predicted sum of squares) avoids the "it only deals with the data at hand" problem. It gauges how well a model accounts for new observations. As well as its predictive capabilities, a key plus is that it can help prevent overfitting; if there is a large difference between your r2 and predictive r2 values, that's an indication you have too many terms in your model.A big minus: it is not widely available. At the time of writing, Excel doesn't include it, nor does SPSS (although they have published a workaround). A few options:Minitab added Predicted r2 to Minitab 17In R, some packages (like DAAG) include Predicted r2. PRESS can also be implemented as part of the leave-one-out cross validation process (see Theophano Mitsa's post for more details).3. Formula TweaksThere are more more than a few formulas for r-squared (defining them all is beyond the scope of this article, but if you're interested, see Kvalseth et. al, as cited in Alexander et. al). A simple, and common formula is shown below. Some alternatives to this particular formula include using the median instead of the summation (Rousseeuw), or absolute values of the residuals instead of the square (Seber).More formula tweaks deal specifically with the problem of outliers. Having them in your model could pose a problem: least squares r-squared for variable selection in a linear regression model is sensitive to outliers. According to Croux & Dehon, the addition of just one single outlier can have a drastic effect. An alternative, R2LTS, ( Saleh), uses least trimmed squares; The author reports that this method is not sensitive to outliers. Rousseeuw uses an M-estimator to achieve a similar effect.4. Simply Report the StatisticsSometimes, you just can't avoid reporting r2, especially if you're publishing a paper. If you can't avoid r-squared, the alternative is to use it, but wisely. Alexander et al. suggest that you:Get the r-squared value from test data using the above equation, not from a regression of observed on predicted values, and"...simply report the R2 and RMSE or a similar statistic like the standard error of prediction for the test set, which readers are more likely to be able to interpret."ReferencesWhy I'm not a fan of R-squaredCan we do better than r-squared?Multiple Regression Analysis: Use Adjusted R-Squared and Predicted R-Squared to Include the Correct Number of VariablesModel selection via robust version of r-squaredMitsa, T. Use PRESS, not R squared to judge predictive power of regression. Beware of R2: simple, unambiguous assessment of the prediction accuracy of QSAR and QSPR models.Kvalseth et. al. Cautionary note about R 2. The American Statistician, November, 1985 Rousseeuw PJ. Least Median of Squares Regression. J. Am. Stat. Assoc. 1984;79:871–880 Seber GAF. Linear Regression Analysis. John Wiley & Sons; NY: 1977. p. 465See More

]]>

]]>

]]>

]]>

]]>