]]>

]]>

This article was written by Jim Frost.Regression is a very powerful statistical analysis. It allows you to isolate and understand the effects of individual variables, model curvature and interactions, and make predictions. Regression analysis offers high flexibility but presents a variety of potential pitfalls. Great power requires great responsibility!In this post, I offer five tips that will not only help you avoid common problems but also make the modeling process easier. I’ll close by showing you the difference between the modeling process that a top analyst uses versus the procedure of a less rigorous analyst.Tip 1: Conduct A Lot of Research Before StartingBefore you begin the regression analysis, you should review the literature to develop an understanding of the relevant variables, their relationships, and the expected coefficient signs and effect magnitudes. Developing your knowledge base helps you gather the correct data in the first place, and it allows you to specify the best regression equation without resorting to data mining.Regrettably, large data bases stuffed with handy data combined with automated model building procedures have pushed analysts away from this knowledge based approach. Data mining procedures can build a misleading model that has significant variables and a good R-squared using randomly generated data!In my blog post, Using Data Mining to Select Regression Model Can Create Serious Problems, I show this in action. The output below is a model that stepwise regression built from entirely random data. In the final step, the R-squared is decently high, and all of the variables have very low p-values! Automated model building procedures can have a place in the exploratory phase. However, you can’t expect them to produce the correct model precisely. For more information, read my Guide to Stepwise Regression and Best Subsets Regression.Tip 2: Use a Simple Model When PossibleIt seems that complex problems should require complicated regression equations. However, studies show that simplification usually produces more precise models.* How simple should the models be? In many cases, three independent variables are sufficient for complex problems.The tip is to start with a simple a model and then make it more complicated only when it is truly needed. If you make a model more complex, confirm that the prediction intervals are more precise (narrower). When you have several models with comparable predictive abilities, choose the simplest because it is likely to be the best model. Another benefit is that simpler models are easier to understand and explain to others!As you make a model more elaborate, the R-squared increases, but it becomes more likely that you are customizing it to fit the vagaries of your specific dataset rather than actual relationships in the population. This overfitting reduces generalizability and produces results that you can’t trust.Learn how both adjusted R-squared and predicted R-squared can help you include the correct number of variables and avoid overfitting.Tip 3: Correlation Does Not Imply Causation . . . Even in RegressionCorrelation does not imply causation. Statistics classes have burned this familiar mantra into the brains of all statistics students! It seems simple enough. However, analysts can forget this important rule while performing regression analysis. As you build a model that has significant variables and a high R-squared, it’s easy to forget that you might only be revealing correlation. Causation is an entirely different matter. Typically, to establish causation, you need to perform a designed experiment with randomization. If you’re using regression to analyze data that weren’t collected in such an experiment, you can’t be certain about causation.Fortunately, correlation can be just fine in some cases. For instance, if you want to predict the outcome, you don’t always need variables that have causal relationships with the dependent variable. If you measure a variable that is related to changes in the outcome but doesn’t influence the outcome, you can still obtain good predictions. Sometimes it is easier to measure these proxy variables. However, if your goal is to affect the outcome by setting the values of the input variables, you must identify variables with truly causal relationships.For example, if vitamin consumption is only correlated with improved health but does not cause good health, then altering vitamin use won’t improve your health. There must be a causal relationship between two variables for changes in one to cause changes in the other.To read the rest of the article, click here.See More

This article was written by Graph Commons.A common task for a data scientist is to identify clusters in a given data set. The idea is to simply find groups of objects that have more connections or similarities to one another than they do to outsiders. In the study of networks, we use clustering to recognize communities within large groups of connections.Typically, a force-directed layout algorithm organizes a network map, makes patterns visually comprehensible, but it cannot identify and mark the clusters. Furthermore, in large network maps, the high level of detail overwhelms our senses. To be able to precisely examine its patterns, we need quantitative views of the data contained in the network. While there are a variety of data clustering methods in machine learning, the Louvain Modularity algorithm works well particularly for large data-networks. It detects tightly knit groups characterized by a relatively high density of ties. Beyond the visual realm, you can use a Louvain clustering algorithm to partition a many million-node online social network onto different machines.Once the network clusters are detected, the identified groups of nodes can be given distinct color and names, so they are clearly differentiated and together provide a summary of the larger network. We can label a cluster based on the commonalities of its nodes or the most central nodes found in the grouping.In Graph Commons, you can use clustering on your data-networks using the Analysis bar. You first click on the “Run Clustering” button, then set the resolution of how much granular clusters the algorithm should identify. Once the clusters are found, they are automatically labelled based on the most connected node in the cluster. However, we strongly recommend that you to rename these communities yourself to highlight what these communities specify in your context. Finally, you can view the list of all the nodes that belongs to a certain cluster and download it as a CSV file.Cluster labels on the network mapIn Graph Commons, you’ll notice the cluster labels are also placed on the map visually. You can move them around and change their size in order to make the network more readable.When you mouse over a cluster label, it will be highlighted, this way you can clearly see its boundaries and where it is located the larger picture. Cluster labels on the map provide an overview for a complex network that is otherwise hard to grasp visually.Bridges between clustersWithin the clusters of a complex network, we often see few nodes making connections to other clusters besides their neighbouring nodes, whose connections are only local, within their immediate cluster. Those nodes that are bridging connections among multiple clusters have high betweenness centrality. Such bridging nodes between two or more clusters become distinctly visible with the help of the network layout algorithms.If we are analyzing a social network, these bridging people are well-positioned to be information brokers, since they have access to information flowing in other clusters. They are the ones who carry the gossip from one group of people to another. They are in a position to combine variety of knowledge and ideas found in multiple groups. On the other hand, bridging nodes have more likelihood of being a single point of failure. If a bridge person disappears, those formerly connected communities would disconnect.To read the whole article, with illustrations, click here.See More

]]>

]]>

This article is on the blog artificialintelligenceml.This article features the following applications, one of them is pictured above (recommendation engine).Google’s AI-Powered PredictionsRidesharing Apps Like Uber and LyftCommercial Flights Use an AI AutopilotSpam FiltersSmart Email CategorizationPlagiarism CheckersRobo-readers to Grade EssaysMobile Check DepositsFraud PreventionCredit DecisionsImage Recognotion on Social NetworksSearch Optimization for Online CatalogsRecommendations Engines (Yelp, Amazon)Voice to Text Translation on Mobile PhonesRead the full article with illustrations for each application, here. See More

]]>

]]>

This article was written by gk_.Understanding how chatbots work is important. A fundamental piece of machinery inside a chat-bot is the text classifier. Let’s look at the inner workings of an artificial neural network (ANN) for text classification.We’ll use 2 layers of neurons (1 hidden layer) and a “bag of words” approach to organizing our training data. Text classification comes in 3 flavors: pattern matching, algorithms, neural nets. While the algorithmic approach using Multinomial Naive Bayes is surprisingly effective, it suffers from 3 fundamental flaws:the algorithm produces a score rather than a probability. We want a probability to ignore predictions below some threshold. This is akin to a ‘squelch’ dial on a VHF radio.the algorithm ‘learns’ from examples of what is in a class, but not what isn’t. This learning of patterns of what does not belong to a class is often very important.classes with disproportionately large training sets can create distorted classification scores, forcing the algorithm to adjust scores relative to class size. This is not ideal.Join 30,000+ people who read the weekly Machine Learnings newsletter to understand how AI will impact the way they work and live.As with its ‘Naive’ counterpart, this classifier isn’t attempting to understand the meaning of a sentence, it’s trying to classify it. In fact so called “AI chat-bots” do not understand language, but that’s another story.If you are new to artificial neural networks, here is how they work.To understand an algorithm approach to classification, see here.Let’s examine our text classifier one section at a time. We will take the following steps:refer to libraries we needprovide training dataorganize our dataiterate: code + test the results + tune the modelabstractThe code is here, we’re using iPython notebook which is a super productive way of working on data science projects. The code syntax is Python.We begin by importing our natural language toolkit. We need a way to reliably tokenize sentences into words and a way to stem words. To read the whole article, with demonstration, click here.See More