]]>

In part 1, I compared a few model evaluation techniques that fall under the umbrella of 'general statistical tools and tests'. Here in Part 2 I compare three of the more popular model evaluation techniques for classification and clustering: confusion matrix, gain and lift chart, and ROC curve. The main difference between the three techniques is that each focuses on a different type of result:Confusion matrix: false positives, false negatives, true positives and true negatives.Gain and lift: focus is on true positives.ROC curve: focus on true positives vs. false positives.That said, you'll want to choose a method that gives you the answers you need for the particular field you're in. For example, while a confusion matrix can be a great tool for comparing models, it isn't much good for marketing decisions (where the gain and lift chart would be a better choice).Other less popular (but still valid) tools include the K-S chart and Gini Coefficient.Confusion MatrixA confusion matrix, in predictive analytics, shows the rate of false positives, false negatives, true positives and true negatives for a test or predictor. In machine learning, a confusion matrix can be used to show how well a classification model performs on a set of test data. Correctly assigned values appear in their relative diagonal box:Negative values are correctly classified as negative (box a)Positive values are correctly classified as positive (box d)Wrongly assigned observations are labeled as either false positives (box b) or false negatives (box c).The false positive rate, or proportion of negative cases (incorrectly) identified as positive, is calculated with the equation fpr = b/(a + b).The false negative rate tells us what proportion of positive cases were incorrectly labeled as negative. The equation is fnr = c/(c + d).The overall accuracy of the prediction or test is defined as (a + d)/(a + c + d + e).Gain and Lift ChartsConfusion matrices can give you a good idea about how effective your model is. It can also help you choose between multiple competing models.But sometimes you want to know how a particular model does with more data; For example, does a model perform better with 60% of data, compared to 50%? This is where gain and lift charts come in.The following gains chart, run on a validation set, shows that with 50% of the data, the model contains 90% of targets, Adding more data adds a negligible increase in the percentage of targets included in the model.A lift chart shows you how much better your model performs, compared to random selection. The "lift" is the ratio of results with and without the model; Better models have higher lifts.While the confusion matrix gives proportions between all negatives and positives, Gain and lift charts focus on the true positives. One of their most common uses is in marketing, to decide if a prospective client is worth calling.Gain and lift charts work with a sample (a fraction of the population). In comparison, a confusion matrix uses the whole population to evaluate a model.ROC CurveA Receiver Operating Characteristic (ROC) Curve is a way to compare models. It is a plot of the true positive rate against the false positive rate. It's similar to the gain and lift chart, but instead of just true positives, this time the focus is on a graphical representation of true positives vs. false positives.In layman's terms, the closer the graph is to the top and left borders, the more accurate the model. If you'e familiar with calculus (specifically, areas under the curve), the ideal model has an area of 1; a random model (with a 50%) chance is shown with the black diagonal on the graph. Also shown on the above example; two models in blue and red. The blue line represents a more accurate model as it is closer to the top and left borders. References11 Important Model Evaluation Techniques Everyone Should KnowHow to evaluate classification models for business analytics - Part 1How to determine the best model?See More

]]>

Monday newsletter published by Data Science Central. Previous editions can be found here. The contribution flagged with a + is our selection for the picture of the week. To subscribe, follow this link. Featured Resources and Technical Contributions Comparing Model Evaluation Techniques - Part 2: Classification and ClusteringData Science using your high school maths knowledge - Gradient DescentLinear and logistic regression in Excel and R - with RegressItFree book: The Data Engineering cookbook by Andreas Kretz Try our New Random Number Generator Visualizing and Animating Optimization Algorithms with Matplotlib Streamlining Predictive Modeling Workflow with Sagemaker and Essentia Building Neural Network in Keras The Mathematics behind Artificial Intelligence and Deep Learning Bayesian Machine Learning Scraping Nasdaq news using Python Breast Cancer Classification & Prediction using Neural Networks Question: Flag vectors to encode features (Python) Featured ArticlesThinking about Moving Up to Automated Machine Learning (AML) Value Chains vs Network…Make Way for Co-creation Business Models Let's Add Coding to Writing in Mathematics Modern Microservice Applications: Foxes vs. pigeons vs. groundhogs Return of the Thriller “3 Horizons of Digital Transformation” +If not mean-variance finance, then what? Is Python Completely Object Oriented? A Technical Look at How Criminals Use AI 5 Reasons You Need a Better Data Management Solution R, without all that pesky code! How IoT Is Shaping the Agriculture Sector Smart Farming, or the Future of Agriculture Picture of the WeekSource: article flagged with a + To make sure you keep getting these emails, please add mail@newsletter.datasciencecentral.com to your address book or whitelist us. To subscribe, click here. Follow us: Twitter | Facebook.See More

]]>

Excel is often poorly regarded as a platform for regression analysis. The regression add-in in its Analysis Toolpak has not changed since it was introduced in 1995, and it was a flawed design even back then. (See this link for a discussion.) That’s unfortunate, because an Excel file can be a very good place in which to build regression models, compare and refine them, create high-quality editable tables and charts, share and present the results, and teach regression to those constituencies of students and practitioners for whom Excel is the only analytic tool they may ever use on a regular basis. Over the last 10 years I've developed an alternative, a free add-in called RegressIt, which is designed to take maximal advantage of the Excel environment and support good practices of data analysis. Its home page is regressit.com, and a set of slides that gives a helicopter tour of its features is here. I've used it for teaching an advanced course on regression and time series analysis to grad students in business and engineering, but it's intended for use in teaching at all levels and in applications. It was first released to the public in 2014 and has undergone major enhancements recently. I urge you to take a look and give it a test drive. It performs both linear and logistic regression in Excel, producing highly interactive model worksheets with well-designed outputs. It also has some novel tools for navigating the model space, keeping an audit trail, and providing instruction as the user goes along. The logistic model worksheets are particularly interesting: they include a lot of tables and charts with spinners that can be used to play with their parameters. For example, you can dial the cutoff value up and down after fitting a model, while watching what happens in classification tables and tracking your position on the ROC curve. This feature does not require the program to be running, so a single model worksheet is a self-contained demonstration tool for properties of a logistic model. See the Titanic example on the web site.And… it has an interface with R that allows R to be used as a computational engine for producing results in both environments. See this short video for a demonstration. This tool provides more analysis options and allows large data sets to be handled. The full dataset does not need to fit in Excel. It suffices to have matching variable names there. This allows Excel to provide a menu-driven front end for performing regression analysis in R that does not require the user to write any code. The outputs in R include some custom tables and charts that resemble the ones that Excel produces for the same models, and the output that R sends back to Excel has most of the same interactive features as the native Excel output (color coding of coefficients by sign and significance, sorting of coefficient tables, deletion of insignificant variables directly from the coefficient table, inclusion in the model comparison table and other audit trail views, etc.). The analysis options in R include a number of different kinds of testing (a fixed test set, iterated random test sets, k-fold cross-validation, and simultaneously fitting separate models to disjoint subsets of data), and they are accompanied by very detailed comparative statistics. Stepwise variable selection can also be combined with them. A nice (and often sobering) exercise is to run 10 or 20 iterations of a randomly chosen training set (say, 2/3 of the data) with stepwise variable selection in order to see what a range of models you may get.I hope that many of you will find this synthesis of Excel and R to be useful, even if its implementation is not elegant by R standards. The VBA code in the add-in writes a script file (which is verbose) and places a line of code to run the script on the clipboard. The user just needs to hit Ctrl-V and Enter in the console in RStudio in order to run the script, which will produce output there and also send it back to Excel. (The user also needs only to hit Ctrl-V and Enter in RStudio when importing data from Excel and loading packages. The entire R session can be controlled with those two keystrokes and mouse clicks. Of course, the user can also do more playing around with the model by typing code.) So, most of the program logic resides in Excel, and two-way communication with R takes place via the clipboard and text files. It uses about a dozen existing R packages for various options, rather than providing a new package of its own. The system of generating a separate script file for each model leaves an audit trail: there is a separate script behind every R model worksheet in the Excel file. Usually the scripts will never be looked at, though, because the models can be instantly re-created at any time from their worksheets in the Excel file, which is where the user will probably spend the most time. If you are someone who has never used R before, you can install it and start fitting regression models in RStudio with this tool in about 10 minutes--instructions are on the web site. There are a lot of other distinctive features in the program. It keeps a multi-threaded audit trail that includes a journal-style model comparison table by default and the ability to search through models on the basis of parent-child relationships (for those fitted with Excel). It includes tools for evaluating and verifying the originality of work submitted by students, and it contains around 10,000 words of internal documentation and teaching notes. A regression model worksheet may contain very detailed teaching notes in the form of cell comments, and these could be customized to suit an instructor if you don't like mine More details of the teaching aids are here. It also has a deep menu of variable transformations and a descriptive analysis procedure that provides rich graphical output for many variables at once, including scatterplot matrices in which each element is an editable Excel chart. A very novel feature of the program (in fact, the most striking feature of the menu interface) is a set of tools for navigating among models, controlling the display of the output, and drilling down to see the layers of information stored within cells, which is designed to make it easy for users to explore the model space in a thoughtful fashion, look at the right outputs, and make good choices.I invite you to take a look at it, regardless of your current level of fondness of Excel as a data analysis tool, and I welcome any feedback or questions. This is free software that I am offering as a public service, and it is intended to serve as a complement, not necessarily a substitute, for whatever you and your colleagues and students may already be using for regression.See More

]]>

]]>

Before I kick off this new blog, I’m happy to announce the release of my 3rd book” "The Art of Thinking Like A Data Scientist”. This book is designed to be a workbook – a pragmatic tool that you can use to help your organization leverage data and analytics to power your business and operational models. The book is jammed with templates, worksheets, examples and hands-on exercises, all composed to help reinforce and deploy the fundamental concepts of Thinking Like A Data Scientist. I hope you enjoy it!The business models that exploit new sources of economic value creation and capture are constantly shifting, driven by technology changes. For example, traditional business models based upon the Value Chain are being replaced by business models fueled by interconnected networks. And eventually those network-based business models will be attacked by new business models that figure out new ways to identify, capture and operationalize new sources of customer, product, operational and market value,So, let’s take a stroll down memory lane to understand how business models have transformed to capture value. And we’ll end the journey by teleporting into a future world where again new technologies are enabling new business model opportunities for economic value creation.Value Creation Phase 1: Value Chain-based Business ModelsMichael Porter introduced the Value Chain concept in 1980 to communicate how economic value was created. A value chain is a set of activities through which products or services pass, in which each activity adds more value to the end product or service. Put another way, a value chain is an inflexible series of sequential interconnected nodes to support a singular purpose, whose execution is brittlely-suspectable to breakage in any one of the interconnected nodes. See Big Data MBA: Course 101A – Unit III” for more on Value Chain Analysis (see Figure 1).Figure 1: Michael E. Porter “Competitive Strategy: Techniques for Analyzing Industries and Competitors”Value Chain business models rose to prominence during the information revolution as a way to identify and capture new sources of value creation. The Value Chain concept fit perfectly into a business environment where manufacturers held ultimate power to dictate terms and conditions to value chain players including customers, distributors and suppliers. For example, Consumer Package Goods (CPG) companies spent significant sums on focus groups and research studies in order to gain superior insights of the behaviors of different customer segments (Soccer Moms, Country Squires, Yuppies, Vegetarians). With those insights in hand, CPG Manufacturers dictated to distributors and retailers pricing, promotional activity and in-store placement while setting the pricing, quality and logistic requirements for their suppliers. While this Value Chain model is still prevalent today (see Apple’s control of the smart phone industry and their outsized 73% share of industry profits), the world is moving away from these directive, command-and-control value chain-based business model to network-based business models where players are capturing new sources of customer, product and operational value.Value Chain-based Business Models SummaryValue Beneficiary? ManufacturerHow is Value Created and Captured?Analytics-driven reductionism to eliminate unnecessary, non-value-add tasksAutomation to reduce or eliminate labor costsOutsourcing non-critical tasks to lower-cost countriesData management and analytics to drive operational excellenceValue Creation Phase 2: Network-based Business ModelsThe economic potential of centralized networks is best understood by looking at Metcalfe’s Law. Metcalfe's Law states the effect of a network is related to the fact that the number of unique possible connections in a network of nnodes (see Figure 2).Figure 2: The Economics of Metcalfe’s LawMetcalfe’s Law provides a framework for rethinking how value is created with a network-centric business model. In a network business model, economic value is captured by organizations that master the matching (codify and optimize) of customer usage patterns with product performance. Telco’s, financial services organizations, social media companies and marketplace exchanges exploit the economics of centrally-controlled networks through the creation of detailed, granular view of individual customers. Instead of each customer fitting nicely into a single demographic segment, these companies are exploiting the granular views of each individual customer’s behaviors, preferences, propensities, inclinations, tendencies, interests, associations and affiliations via Analytic Profiles.Network-based Business Models Value CreationValue Beneficiary? Network OwnerHow is Value Created and Captured?Rapid growth of the network (first mover advantage; block out competitors)Scaling through automationCost management of massive volumes of granular, structured and unstructured dataData engineering and DataOps expertise to convert raw data into high-value, curated dataData science mastery to codify customer behaviors and product performanceValue Creation Phase 3: Value Chain to Network Business Model TransformationToday we are seeing Value Chain-based business models morphing into Network-based business models through Products-as-a-Service (Xaas) business models. Instead of selling products, these companies are selling outcomes as a service (aircraft engine manufacturers selling Thrust-as-a-Service, or industrial compressor company selling Air-as-a-Service). The keys to Xaas business model success include (see Figure 3):Superior consumer product usage insights (product usage tendencies, inclinations, affinities, relationships, associations, behaviors, patterns and trends). Xaas players must be able to quantify and predict where, how and under what conditions the product will be used and the load on that product across numerous product usage dimensions including work type, work effort, time of day, day of week, time of year, local events, holidays, work week, economic conditions, weather, precipitation, air quality / particulate matter, water quality, remaining useful life, salvage value, etc.Superior product operational insights (product performance or operational tendencies, inclinations, affinities, relationships, associations, behaviors, patterns and trends) to support product operational excellence use cases including reduction of unplanned operational downtime, predictive maintenance optimization, repair effectiveness optimization, inventory cost reductions, parts logistics optimization, elimination of O&E inventory, consumables inventory optimization, energy efficiencies, asset utilization, technician retention, remaining useful life, predicted salvage value, etc.Superior data and instrumentation strategy; knowing what data is most important for what use cases and where to place sensors, RTU’s, and other instrumentation devices in order to capture that data so as to balance the costs of False Negatives (from lack of instrumentation) versus False Positives (from too much instrumentation).Figure 3: Xaas Business Model: Economics Meets AnalyticsCheck out "Xaas Business Model: Economics Meets Analytics" for more details on Xaas.Xaas-based Business Models SummaryValue Beneficiary? ManufacturerHow is Value Created and Captured?Data management and instrumentation strategy and DataOps / Data Engineering expertise to capture customer usage and product performance dataSuperior analytics (Data Science) to codify user usage pattern insights in order to determine optimal pricing and service level agreementsSuperior analytics (Data Science) to codify product performance insights in order to drive product and operational excellence What’s Next? Co-creation Business Model?As the complexity of business and operational environments continue to grow, amount of data explodes and new digital technologies (IoT, 5G, AI/ML, Autonomous Vehicles, AR/VR, 3D Printing, Blockchain) are thrust upon us, some industrial companies are banding with Information Technology providers to jointly create and commercialize new business opportunities. Heck, I think any business objective with “smart” as an adjective – smart cities, smart factories, smart airports, smart oil fields, smart theme parks, smart hospitals – are candidates for co-creation efforts between Industrial and Technology partners.This will be more than today’s Systems Integrator-to-Customer project-basedrelationships that yield one-off solutions (and where the vast majority of project costs and risks are borne by the customer). A true Co-creation business model is a market-basedcollaboration where industrial companies and technology providers “get married” by integrating digital capabilities with the customer’s operational expertise to create new market opportunities. It’s about both parties mastering the economic value of digital assetslike data, analytics and apps.To be successful at co-creation, the following characteristics are going to need to exist:Common Vision about the sources of value creationCommon Language and process in order to productize and operationalize subject matter expertiseOrganizational Improvisation – yes, elephants dancing on the head of a pin – in order to react to the inevitable demons and monsters that appear along the journeyOpen and Sharing Culture that ultimately leads to a shared culture of trustWait, that sounds very familiar…Check out my blog "Scaling Innovation: Whiteboards versus Maps" for more details.Watch this space for more details as we being to work with more and more companies to perfect this mutually-beneficially, win-win-win Co-creationbusiness model.Value Chains, Networks and Co-Creation Business Models SummaryValue Chain-based business models thrived when manufacturers could exert control over value creation by dictating terms and conditions to customers, distributors and suppliers.Metcalfe’s Law introduced the disruptive power of inter-connected networksand the potential for new sources of economic value creation.Today, Value Chain business models are being challenged by Network-based business models that have found new ways to create new sources of economic value.Network-based business models create economic value by mastering the matching (codify, optimize) of customer usage patterns with product performance insights.Companies withValue Chain-centric business models are trying to morph into Network-based business models by embracing Product-as-a-Service (Xaas) capabilities, which requires more advanced data and analytics strategies.New business models are on the horizon, one of which is the Co-creationbusiness model which drives OT + IT collaboration to advance beyond project-based dating to market-creating marriages. See More

One of my most memorable experiences during my college years was having an English professor who spoke entirely in complete sentences.There were no spoken sentence fragments, dangling thoughts or misuse of adjectives, verbs or nouns. When I first met him, I was stunned at his level of commitment to this principle. I often wondered how challenging this must be for him and whether or not the journey had instances where a thought started and had to be gracefully abandoned into a mundane compromise of a sentence. It was probably also difficult to not have the verbal equivalent of a delete key. How was this possible, if not rehearsed?I immediately started considering what level of commitment it would take to do the same in mathematics. Can one speak in complete mathematical sentences? What are the rules for thinking in math? Do the same rules apply for writing in mathematics? Furthermore, what are the rules for writing in mathematics? If learning math is like learning a foreign language because of the various symbols and algorithmic rules, are there opportunities that teachers can leverage that will allow students to have greater access in communicating their thinking and ideas in mathematics? One fact is certain, the way that mathematics is currently taught is often lacking context and motivation, and alienates students. Mathematicians need to allow students various entry points to the content. Simply providing an equation, formula, graph, illustration or text without clear explanations, furthers the myth that mathematics is inaccessible to the masses.First Writing in MathematicsFor most students, the first writing experience in mathematics typically involves the writing of numbers. Tally marks provide a nice transition to the notion of keeping track-of and summarizing the concept of ‘number’. Then, there is often times a leap to the algorithmic thinking of addition, subtraction and multiplication solely represented as setting up the problem in a traditional format. After working with students of all ages, it has become abundantly clear that this approach endorses algorithm thinking in lieu of conceptual understanding. In fact, having students write about the process of using the algorithms and describing how they work does more for their conceptual understanding than just solving problems.Examples of math sentences at an early age include: 1+2=3, 5 x 4=20 and 12–7=5. In its simplest form, math sentences reflect math facts, uses simple operators like {+, — , x} and include at least two operands.More importantly, having students write to describe what they are doing holds much more value. In addition, changing the problem slightly by making one of the operands a question mark opens up their thinking to include multiple answers. For example: ?- 4 = 17, 50 - ? =23 or ? x ?= 34 all promote deeper thinking tasks to write about. Beyond this notion of mathematical sentences lies a deeper notion of mathematical thinking. Therefore, it’s less about sentence structure and rather how we can develop the ability of students to think more broadly about problem solving.Mathematical Problem TypesAlthough word problems in mathematics can be introduced in elementary grades, middle grades tend to spend quite some time in developing students’ skills on these types of problems. These skills can best be described as the ability to :Identify important information in word problemsDetermine variables and constantsUnderstand what the problem is askingIdentify relationships between variablesConstruct an equation that would lead to a solutionSolve the given algebraic equationThese are all very important skills to develop and hone. However, these skills also underscore the notion of a single approach to problem-solving which leads to a single solution. This is contrary to how mathematicians approach real-world problem-solving. For mathematicians, a considerable amount of time is spent writing and reflecting on multiple ways to get to a variety of solutions based on what’s being optimized. Asking the right questions goes a long way toward finding a solution. Using this approach results in multiple solutions based on many of the decisions that must be made about the problem-solving approach. Given these facts, here is an illustration of how students spend their time writing and thinking about mathematics versus how mathematicians do the same.Rigor in mathematical writing and thinking increases from left to right. Note that some students have K-12 experiences that focus on obtaining a single solution and never move beyond the word problem stage. Whereas, mathematicians approach problem solving by spending considerable time detailing and describing possible mathematical models, often leading to multiple solutions, effectively communicate their solutions and publish their work for others to review. Adapted from SIAM “Guidelines for Assessment & Instruction in Mathematical Modeling Education”Application problems are typically tied to a specific field or context and might contain real-world data for students. However, the problems are still closed in the sense that many of the decisions, variables and values have already been pre-determined. It may take the form of a pre-determined equation of how a ball behaves when thrown or dropped and use real-world measurements to test the model. Mathematical modeling problems are different in that, most of the decision making of how to construct the model, determine which variables to use, determine which data to collect, and choosing what is important are left up to the student. The student quickly understands that there may be multiple solutions and that the quality of the solution will be important. Questions begin to emerge about the solution and approach, such as: how close is my solution to the performance of the phenomena? Can I use this model to make predictions about future or past behavior? Can this model solve other similar phenomena-based problems? The writing and thinking that is involved in modeling problems far exceed the writing and thinking involved in applying an algorithm to solve a computational problem.Photo by Patrick Amoy on UnsplashCoding as Writing in MathematicsIf writing is an expression of thinking, then it’s time to allow every student of mathematics to learn how to write code. Coding is now considered to be an essential skill in developing algorithmic thinking. It is not just suitable for engineering, technology or math majors, it is an essential skill for all students to learn problem-solving. When I went to high school, coding was taught in my math courses and if we were lucky, we would have a separate computer class to learn various languages. Simply put, you can’t fully grasp some mathematical concepts without experimentation and investigation. Coding allows for this. Not allowing students to code in mathematics is akin to teaching art without a canvas and brushes or teaching writing without a pen. Coding should be viewed as a tool for expressing mathematical ideas. It is a form of thinking and writing that is essential for understanding mathematics.Writing ToolsThe gold standard for typesetting in mathematics is LaTeX. It is a markup language similar to html that’s used by math researchers across the globe. When it comes to the tools that are available for writing mathematics, there is a huge challenge. Many of the applications that are available, including TeX and LaTeX, are simply inaccessible to teachers and students. Microsoft’s Equation Editor within the Word application is cumbersome for students to search for and use the various operators and symbols to communicate their ideas. Most math writing applications have a significant learning curve for both teachers and students. Part of the challenge lies in the fact that math has very complicated text requirements. A writing tool for mathematics will need to allow for creating graphs, tables, special characters, special operators and formatted text.Toward the goal of making writing in mathematics more accessible, many companies are following what was initially proposed by Wolfram’s Mathematica many years ago. That is, they created a tool which allowed students to collect their thoughts, graphs, scripts, algorithms and formatted text into a single document called a notebook within the application. Several other companies are following suit. Examples include Jupyter Notebook for Python scripts, Matlab Live Editor from MathWorks, Scientific Notebook by MacKichan Software and R-Studio’s R-Markdown. Many of these programs also allow for embedding TeX or LaTeX scripting as part of their feature set.ConclusionI still can’t do what my college English professor did many years ago. But with computers, I can think more broadly about problem solving in a variety of domains. It’s time for a change! Let’s make coding a requirement in every math course for students. We can begin by supporting university teacher preparation programs in developing a coding curriculum that is accessible for all teachers. Teachers need to develop the skill, will and capacity to learn the relevance and importance of coding.Having taught History of Mathematics courses at the university level, what is astonishing to me is the organic nature of the development of mathematics. The complex mathematical rules that govern the universe were developed many centuries ago from a simple need to describe the concept of a number. It was only through an iterative process of experimentation, modeling, trial and error, approximation and documentation that we as humanity were able to make the many leaps in science, technology engineering and mathematics. Computing technology now provides us with tools which can expedite these experiences and learnings. We should embrace these tools and take mathematics to a new level by empowering students to write about their experiences. Let’s also not forget the power of the incredible, easily accessible technology known as pen and paper.See More