]]>

Monday newsletter published by Data Science Central. Previous editions can be found here. The contribution flagged with a + is our selection for the picture of the week. To subscribe, follow this link. Featured Resources and Technical Contributions Cross Validation in One Picture +Create Transformed Polygons Using the Covariance Matrix Data Science Job in 90 days - Book Summary Free Book: Foundations of Data Science (from Microsoft Research Lab) Deep Learning Explainability: Hints from Physics How to Install and Run Hadoop on Windows for Beginners 29 Statistical Concepts Explained in Simple English - Part 13 A Complete Machine Learning Project Walk-Through in Python Free Textbook: Probability Course, Harvard University (Based on R) 19 Courses (MOOC) on Math & Stats for Data Science & Machine Learning An Introduction to Python Virtual Environment Forum QuestionsQuestion: Recommendation system evaluation Question: Record linkage (unsupervised learning) Question: Statistical significance match with multi-variables Question: Scanned documents OCR data cleaning Question: Traffic/commute data and processing Question: Moments of Order Statistics Question about the big O notation Question: Data science audio book for layperson Featured Articles and Forum QuestionsXaas Business Model: Economics Meets Analytics Should You Be Recommending Deep Learning Solutions in Your Company? Profiling Store Visitors Frame a problem as a machine learning problem or otherwise What is Data Lake and How to Improve Data Lake Quality Artificial Intelligence Continues Where Analytics Ends? Implementing Knowledge Graphs in Enterprises - Some Tips and Trends Price Forecasting: Electricity, Flights, Hotels, Real Estate, and Stock Pricing Technology Use as a Function of Device Type Unsupervised learning and its role in the knowledge discovery process Prediction of Customer Churn with Machine Learning Picture of the WeekSource: article flagged with a + To make sure you keep getting these emails, please add mail@newsletter.datasciencecentral.com to your address book or whitelist us. To subscribe, click here. Follow us: Twitter | Facebook.See More

The covariance matrix has many interesting properties, and it can be found in mixture models, component analysis, Kalman filters, and more. Developing an intuition for how the covariance matrix operates is useful in understanding its practical implications. This article will focus on a few important properties, associated proofs, and then some interesting practical applications, i.e., extracting transformed polygons from a Gaussian mixture's covariance matrix.I have often found that research papers do not specify the matrices' shapes when writing formulas. I have included this and other essential information to help data scientists code their own algorithms.Sub-Covariance MatricesThe covariance matrix can be decomposed into multiple unique (2x2) covariance matrices. The number of unique sub-covariance matrices is equal to the number of elements in the lower half of the matrix, excluding the main diagonal. A (DxD) covariance matrices will have D*(D+1)/2 -D unique sub-covariance matrices. For example, a three dimensional covariance matrix is shown in equation (0).It can be seen that each element in the covariance matrix is represented by the covariance between each (i,j) dimension pair. Equation (1), shows the decomposition of a (DxD) into multiple (2x2) covariance matrices. For the (3x3) dimensional case, there will be 3*4/2–3, or 3, unique sub-covariance matrices.Note that generating random sub-covariance matrices might not result in a valid covariance matrix. The covariance matrix must be positive semi-definite and the variance for each dimension the sub-covariance matrix must the same as the variance across the diagonal of the covariance matrix.Positive Semi-Definite PropertyOne of the covariance matrix's properties is that it must be a positive semi-definite matrix. What positive definite means and why the covariance matrix is always positive semi-definite merits a separate article. In short, a matrix, M, is positive semi-definite if the operation shown in equation (2) results in a values which are greater than or equal to zero.M is a real valued DxD matrix and z is an Dx1 vector. Note: the result of these operations result in a 1x1 matrix.A covariance matrix, M, can be constructed from the data with the following operation, where the M = E[(x-mu).T*(x-mu)]. Inserting M into equation (2) lead to equation (3). It can be seen that any matrix that can be written in the form M.T*M is positive semi-definite. This full proof can be found here.Note that the covariance matrix does not always describe the covariation between a dataset's dimensions. For example, the covariance matrix can be used to describe the shape of a multivariate normal cluster, for Gaussian mixture models.Geometric ImplicationsAnother way to think about the covariance matrix is geometrically. Essentially, the covariance matrix represents the direction and scale for how the data is spread. To understand this perspective, it will be necessary to understand eigenvalues and eigenvectors.Equation (4) shows the definition of an eigenvector and its associated eigenvalue. The next statement is important in understanding eigenvectors and eigenvalues. Z is an eigenvector of M if the matrix multiplication M*z results in the same vector, z, scaled by some value, lambda. In other words, we can think of the matrix M as a transformation matrix that does not change the direction of z, or z is a basis vector of matrix M.Lambda is the eigenvalue (1x1) scalar, z is the eigenvector (Dx1) matrix, and M is the (DxD) covariance matrix. A positive semi-definite (DxD) covariance matrix will have D eigenvalue and (DxD) eigenvectors. The first eigenvector is always in the direction of highest spread of data, all eigenvectors are orthogonal to each other, and all eigenvectors are normalized, i.e. they have values between 0 and 1. Equation (5) shows the vectorized relationship between the covariance matrix, eigenvectors, and eigenvalues.S is the (DxD) diagonal scaling matrix, where the diagonal values correspond to the eigenvalue and which represent the variance of each eigenvector. R is the (DxD) rotation matrix that represents the direction of each eigenvalue.The eigenvector and eigenvalue matrices are represented, in the equations above, for a unique (i,j) sub-covariance matrix. The sub-covariance matrix's eigenvectors, shown in equation (6), across each columns has one parameter, theta, that controls the amount of rotation between each (i,j) dimension pair. The covariance matrix's eigenvalues are across the diagonal elements of equation (7) and represent the variance of each dimension. It has D parameters that control the scale of each eigenvector.The Covariance Matrix TransformationA (2x2) covariance matrix can transform a (2x1) vector by applying the associated scale and rotation matrix. The scale matrix must be applied before the rotation matrix as shown in equation (8).The vectorized covariance matrix transformation for a (Nx2) matrix, X, is shown in equation (9). The matrix, X, must centered at (0,0) in order for the vector to be rotated around the origin properly. If this matrix X is not centered, the data points will not be rotated around the origin.An example of the covariance transformation on an (Nx2) matrix is shown in the Figure 1. More information on how to generate this plot can be found here.Please see this link to see how these properties can be used to draw Gaussian mixture contours and create non-Gaussian, polygon, mixture models.See More

]]>

]]>

I am building Matching Alogoritm using ML.Project is to match Internal customer data with external customer data.Features are names,address,city,state and zip.We create pairs between data sets and calculate cosine similarity and then pass cosine values for all features pairs to Gaussian Mixture model.We started with 2 cluster, with expectation of one match cluster and one no match cluster.But ML does not build one match cluster and matches are in both the clusters.Before passing to ML, i use Standard scaler and minmax scaler , but still don't get a clear nomatch and match cluster.If we increase the cluster same thing happens.Match could be High cosine similarity in Name,Address,State,City & zip or Name ,address ,zip or any other combinations.We are dealing with huge volume , so we are using Spark ML.How can we achieve optimal clustering?See More

]]>

Digital capabilities leverage customer, product and operational insights to digitally transform business models. And nowhere is this more evident than the rush by industrial companies to digitally transform consumption models by transitioning from selling products to selling [capabilities]-as-a-service (thusly, Xaas). For example:The key issue for the airlines is to maximize their core revenue generating mechanisms:flight scheduling and the hours that the airplane is actually flying. So instead of looking at the features of the jet engine, GE turned their attention to helping airlines more effectively generating more revenue; GE they moved from selling engines to offering Thrust (engines)-as-a-service[1].Kaeser Kompressoren, who manufacturers large air compressors, leverages sensors on its equipment to capture product usage, performance and condition data off of the machines.Kaeser leveraged the product and operational insights gained from these data sources to start selling air by the cubic meter through compressors that it owns and maintains …compressed Air-as-a-service[2].But let’s be honest, anyone can create an Xaas business model. The key is not creating an Xaas business model; the key is creating a profitable Xaas business model. That means that organizations moving to an Xaas business model must master operational excellence (remote monitoring, sensors, predictive maintenance, first time fix, inventory optimization, technician scheduling, asset utilization), pricing perfection and meeting agreed-upon customer Service Level Agreement (SLA) requirements to ensure Xaas business model success. Loss of Manufacturers “Pricing Inefficiency” AdvantageXaas business models eliminate manufacturer or producer pricing advantages where people and consuming organizations were willing to over-pay for capacity; that is, consumers or consuming organizations often bought more than they actually needed for convenience, safety stock and/or to support maximum demand requirements (Christmas shopping). For example, buying a car. On average, a car is used less than 5% of the time[3]. However, consumers buy this expensive asset (average car price = $34,000 in 2019[4]) that sits unused over 95% of the time. This elimination of the producer pricing advantage – also known as Producer Surplus – is part of the economics behind the growing success of ride-sharing services. If consumers or consuming organizations only need to pay for what they use (outside of a customary monthly minimum), then the manufacturers/producers stand to lose consider revenue from these consumers and consuming organizations that historically have over-bought capacity.Xaas service models can be a huge win for consumers and consuming organizations because they 1) avoid large, upfront capital expenditures (CapEx) while 2) only paying for what they use. And as in most important business factors, a lot of the impact of Xaas can be explained by basic economics theories.Economics and the Laws of Supply and DemandThe law of supply and demand explains the interaction between the producers of a resource and the consumers for that resource. The theory defines how the relationship between the availability of a particular product and the desire (or demand) for that product has on its price. For the vast majority of goods and services, an increase in price will lead to a decrease in the quantity demanded (see Figure 1).Figure 1: Laws of Supply and DemandThe interaction of the Supply and Demand Curve has two relevant off-shoots from an Xaas perspective: Consumer Surplus and Producer Surplus (see Figure 2).Figure 2: Source “Producer Surplus”Key aspects of Figure 2 are:Consumer Surplus, as shown highlighted in red, represents the benefit consumers get for purchasing goods at a price lower than the maximum they are willing to pay. That is, Consumer Surplus is the monetary gain obtained by consumers because they are able to purchase a product for a price that is less than the highest price that they would be willing to pay. Benefit: Consumer.Producer Surplus, as shown highlighted in blue, is the amount that producers benefit by selling at a market price that is higher than the least for which they would be willing to sell. That is, Producer Surplus is the amount that producers benefit by selling at a market price that is higher than the least price for which the producer would be willing to sell; this is roughly equal to profit. Benefit: Producer.With the Producer Surplus, the Producer benefits from imperfections in the granularity of the capabilities being bought, which forces consumers to over-buy these capabilities just in case they might need them in certain situations. This critical producer advantage is lost in an Xaas consumption model. As an example, let’s look at the impact that ride sharing services are having on the automobile manufacturing industry.Ride-sharing’s Impact on Automobile Manufacturing IndustryTo better understand the potential impact of Xaas business models on manufacturers, let’s look at the impact that ride-sharing services (Uber, Lyft) has had on the automobile manufacturers. Figure 3: Source: “Disrupting The Car”The automobile manufacturer industry is already starting to feel the impact of ride-sharing services on automobile demand. CNBC’s Mad Money article “Ride-sharing is killing car sales—and it's only going to get worse” (March 8, 2018) provides this perspective:“How come the automakers are struggling when the rest of the economy is in such great shape? The main issue is clearly the rise of ride-sharing. Services like Uber and Lyft have brought a secular change to the world of transportation, offering far cheaper travel alternatives to owning a car, especially for city-dwellers.”The economics of the automobile industry model are already starting to shift as fewer cars are sold, and those cars that will be sold will be built for durability (200,000+ miles) and easier maintenance (and the maintenance aspect will be further impacted by the economics of electric vehicles). However, the maintenance and parts businesses will likely keep growing as these ride-sharing services still need to have their cars operational in order to reduce unplanned operational downtime – if they ain’t drivin’ the cars, they ain’t makin’ no money!Xaas Business Model Keys to SuccessWhat can be a big win for consumers and consuming organizations can be a big loss for manufacturers. To offset the loss of the Producer Surplus advantage, producers need to seek new economic advantages through the acquisition of superior customer, product and operational insights gathered from IoT, machine learning and artificial intelligence capabilities.Superior insights into consumer productusage patternscoupled with superior insights into product performance patternsenables Xaas industrial manufacturers to determine the optimal operational, pricing and customer service (SLA) models to ensure a viable and profitable Xaas business model. The keys to Xaas business model success include the following:Superior consumer product usage insights (product usage tendencies, inclinations, affinities, relationships, associations, behaviors, patterns and trends). Xaas players must be able to quantify and predict where, how and under what conditions the product will be used and the load on that product across numerous product usage dimensions including work type, work effort, time of day, day of week, time of year, local events, holidays, work week, economic conditions, weather, precipitation, air quality / particulate matter, water quality, remaining useful life, salvage value, etc.Superior product operational insights (product performance or operational tendencies, inclinations, affinities, relationships, associations, behaviors, patterns and trends) to support product operational excellence use cases including reduction of unplanned operational downtime, predictive maintenance optimization, repair effectiveness optimization, inventory cost reductions, parts logistics optimization, elimination of O&E inventory, consumables inventory optimization, energy efficiencies, asset utilization, technician retention, remaining useful life, predicted salvage value, etc.Superior data and instrumentation strategy; knowing what data is most important for what use cases and where to place sensors, RTU’s, and other instrumentation devices in order to capture that data so as to balance the costs of False Negatives (from lack of instrumentation) versus False Positives (from too much instrumentation).Xaas business model profitability can be achieved when you marry all three of these data, instrumentation and analytics strategies, and this is a level of data, instrumentation and analytics well beyond what most industrial organizations are contemplating today. It is only when these three dimensions are optimized can one optimize Xaas business model success through a “smart” operational environment that knows how to self-monitor, self-diagnose and self-heal (see Figure 4).Figure 4: 3 Stages of Creating SmartSee the blog “3 Stages of Creating Smart” for more details on how leading industrial organizations are leveraging data and analytics to digital transformation their operational and business models.As anyone who is working to achieve their Big Data MBAknows, the organizations that win in this era of digital transformation are those organizations that successfully leverage data and analytics to digitally power their business models (see Figure 5).Figure 5: The Big Data Business Model Maturity Index [1] “Turning Products Into Services”[2] “How Can You Sell Air? As a Service, Of Course”[3]“Why do Uber and automation really matter? Because we barely drive the cars we own.”[4]“The Average New Car Price Is Unbelievably High”See More

As a senior datascience professional and analytics manager, I get countless requests for job search advice, resume feedback and heart-breaking stories from brilliant students who are unable to snag a job in this exciting field. There are tons of books on how to learn the skills to become a data scientist/ data analyst, but none to prepare folks for the frustrating job search.I've repeated this advice to dozens of people, most of whom found their dream datascience job with companies like LinkedIn, Walmart, Comcast and many more. This strategies are now available on Kindle in the form of an ebook "Data Science Jobs - land a lucrative job in 90 days". Amazon book link here. Who Should Read this Book?Students with computer science or math majors, looking to find a job in the data science field.International student on F-1/OPT visa looking for employment after a graduate degree in analytics.Employed professionals looking to pivot their career, or seeking better pay/manager/location.Students from coding bootcamps or online nanodegree, who are embarking on job search journey.The book lists techniques that allow you to put your resume directly in the hands of hiring managers and decision makers, instead of relegating it to the Black Hole of online application systems. The book is deliberately kept short so that you can read through quicky and apply these principles to succeed in your job search.These book chapters can be broadly classified into the themes below:Personal Branding - Create an online profile that helps you bubble up when hiring managers look for candidates. Make the jobs come to you! Tips to tweak your resume to achieve the same.LinkedIn - grandfather site for job search. The chapter shows your some creative ways to leverage LinkedIn, not simply accept connections or make merry with the "Apply button".Strategic Networking - don't passively hope to make connections, seek them out.Niche sites - including my favorite online community, DataScienceCentral.Upwork - despite popular opinion (about the site's ineffectiveness), this site is a quick way to earn money and position yourself for your dream role.And many more...LinkedInChapters on LinkedIn and personal branding teach you:How to fill out your profile, so that recruiters and hiring managers come chasing you, instead of the other way around. Using endorsements to improve SEO for your profile. Websites use SEO to be placed on the first page of search ranking.These tips will help you will stand front and center when managers look for candidates. Use the "content" tab to find jobs and hiring managers!Strategic NetworkingNetwork strategically, with a purpose.Unless you know what position you want, you will never be able to get well-wishers to find one for you.Where to look for local "hiring" events, instead of attending random meetups.Don't scoff at recruiters, they can be the allies who halve your "job-hunting" time.Niche SitesHiring managers have finally figured out that data science communities are the best venues to seek talent. So scour through the job pages on specialized communities like our very own DataScienceCentral.com (DSC), Kaggle and KDnuggets. If you need interview prep help, then DSC has some amazing content to help out in that arena also. If you are looking for work in a big city, then try Twitter as well.Upwork. Most folks who write disparagingly about this site are the ones who never made a penny from it. Personally, I've found success with the site earning within a week of joining the site. My experience helped me pad my portfolio with unique "live" projects and helped me learn other soft skills that have been invaluable at later roles at NASDAQ and TD bank. Upwork does take time and being selective about bidding is key. The beauty is that no matter what your skill level, you can start quickly with no caps to your earning potential! The book chapter on Upwork reveals the strategies to help you replicate my success.In conclusion, this book is a condensed guide with practical strategies to make the job search process less stressful, and help readers quickly get hired. So get the ebook on Amazon, and get started on a lucrative career! I read every review, so do leave your feedback in the comments below.See More

AI is a new buzzword but there has been a lot of talk around analytics and prospects over a decade. Here we present a perspective that AI is a continuum of AnalyticsLet us see how...Different approaches in Data Science and analytics have been originated in the field of statistics.AI has emerged on the other hand out of computer science as a practice and science of studying "intelligent agents". The way is to treat AI as extending conventional analytics with a new dimension, namely cognitive analytics, which is an extension of existing analytics forms. Conventional analytics stop at prescriptive analytics, based on statistical techniques. AI enables us to have cognitive capabilities in analytics - where human intelligence is mimicked.Conventional analytics focus on information (BI or business intelligence) level and extend it to the extraction of patterns and predictions as in conventional predictive analytics. But In extending this to systems displaying wisdom, we enter the realm of AI. AI provides autonomy in the form of wisdom to take automatic decisions based on inferences from raw data. So AI can be the enabler to create systems capable of wisdom and extends conventional data science.AI and Analytics can be seen therefore as a continuum rather than separate. It also pushes for a strong need for AI scientists, Data Scientists work in interdisciplinary business to educate and deliver solutions in AI and analytics to customers.Closing thoughts:AI is a continuous extension of analytics endowing wisdom and autonomy to applications wherein analytics stops at a level where a human makes use of the predictive and another kind of analytics to make a decision. However, the root and core algorithms serving these levels are based on similar grounded algorithms like those of machine learning based classification, regression or clustering. Only AI is more powerful on the unstructured data via its vision and NLP algorithms.Hope this serves as a open call for a unified approach to analytics and AI.See More