New Similarity Methods for Unsupervised Machine Learning

1. Introduction

Data science is changing the rules of the game for decision making. Artificial intelligence is living its golden years where abundance of data, cheap computing capacity, and devoted talent depicts an unstoppable intelligence assisted life for humans. While it is common to hear about AI advice on health or financial investments, the same in business strategy is not so common. Maybe it is just a matter of time that AI learns how to handle data to support decision-making on business strategy, but it could also be that there is a lack of theoretical framework for it to build on. Following the competitive dynamics approach proposed in the article Strategizing with Competitive Asymmetry, a quantitative model was built to bridge this gap between business strategy and data science. In this article, I will outline an experiment that compares competitors' data arranged in vectors using this framework. The outcome was an alternative similarity measure, Projection Similarity, as accurate as Cosine Similarity but with asymmetric similarity.

2. Using Cosine Similarity

The competitive dynamics model used in Strategizing with Competitive Asymmetry has two dimensions, Market Commonality and Resource Similarity, and the possible combinations are:

Source: Competitor Analysis and Interfirm Rivalry: Toward a Theoretical Integration, Ming-Jer Chen, Academy of Management Review, 1996, Vol. 21, No. 1, 100-134.

Under this approach, companies were characterized with one vector for each dimension including several determinant traits of their markets and their resources. Cosine Similarity was initially used to compare the vectors pairwise, but two problems arose.

First, Cosine Similarity is symmetric. The similarity of vector A with respect to vector B is the same as the one of vector B with respect of vector A. Cosine Similarity fails to represent competitive asymmetry.

Second, the similarities were very high. In a two-by-two matrix like Image 1 above, the intuitive threshold to classify a data point as high or low is 50%. Above 50% there are more odds that the two data points compared are similar than they are not. They are classified as "high". And vice versa - low if below 50%. With Cosine Similarity even companies radically different had similarities above 50%. If there is a training data set to find which is the optimal threshold, rather than at 50%, this problem is solvable. In this case falls Market Commonality where the industry and the countries where a company operates are known. But, for unsupervised classification, the fact that the optimal threshold falls at the intuitive 50% has a significant impact on the accuracy of the classification. This is the case of Resource Similarity where the skills of a company are neither easily nor publicly known.

3. Using Projection Similarity

An alternative method to compare the vectors was used in order to have asymmetric similarity. The projection similarity of vector A in relation to vector B was calculated as follows:

1. Calculate the orthogonal projection of vector A over vector B

2. Divide the norm of the orthogonal projection by the norm of B, which will give the relative value of the norm of the orthogonal projection in relation to B

3. Subtract 1 to the resulting value of step 2 and take the absolute value (to take advantage of the symmetric distributions)

This difference statistic can be used as Z value of a standard normal distribution to get the Standard Projection Similarity by multiplying the area of the cumulative distribution function from -∞ to -Z by 2:

It could also be used as the exponent of a logistic function to get the Logistic Projection Similarity:

In the next section, we will examine the validity and accuracy of this alternative method.

4. Calculations

At this phase, 5 different measures were calculated: Cosine Similarity, Standard Projection Similarity, Logistic Projection Similarity, Cosine Similarity multiplied by Standard Projection Similarity, and Cosine Similarity multiplied by Logistic Projection Similarity. Each dimension had its unique data set, one for Market Commonality and one for Resource Similarity. The criterion to decide if two companies were similar was set by industry: if two companies are in the same industry, their similarity should be high in any of the dimensions; otherwise, low. The positive outcomes for both data sets were 32%, and the negative outcomes 68%. The chosen criterion carried the implicit assumption that companies in the same industry can differ but not a lot, either in one dimension or the other one. The performance of each method as a function of the threshold value were the following:

Market Commonality

Resource Similarity

5. Conclusions

The two challenges of using Cosine Similarity were the presence of symmetric similarity and the optimal threshold value far from the intuitive 50%.

For the former, the experiment showed that the method of Cosine Similarity multiplied by Logistic Projection Similarity can successfully deliver the best asymmetric similarity with a high accuracy at par with Cosine Similarity.

For the latter, the optimal threshold of Cosine · Logistic (60%) was 5% below the Cosine one (65%) for Market Commonality, and 10% below for Resource Similarity (75% and 85% respectively). But those values were still far from 50%. So, even if there was an improvement, the challenge for unsupervised classification remained.

Following closely, the suboptimal Cosine · Standard method delivered a pair of optimal thresholds of 55% for Market Commonality and 70% for Resource Similarity. Even a bit better.

Read the original article on LinkedIn.

Views: 2168

Tags: cosine, learning, projection, similarity, unsupervised


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Ramon Serrallonga on January 16, 2019 at 7:49am

Sorry. Where I wrote "embedding" I wanted to mean ensemble.

Comment by Ramon Serrallonga on January 14, 2019 at 11:00am

Hi Klaus,

1. Data sets will be always empirical, isn't it? How do you intend to test anything otherwise? Or what you are suggesting is that you miss a theoretical framework to test. Could you develop it a bit more please?

2. I disagree. In data science I have seen a lot of "creative" techniques for its effectiveness. Like "embedding" different models from the same data set. This is how we do science. Experimenting. From the concrete case to the general one. Not the other way around.

Comment by Klaus Wassermann on January 11, 2019 at 1:55am


the proposed argument is flawed in 2 different ways:

1. It is not possible, from a scientific standpoint, to compare similarity measures on empirical data sets
2. the quality of a similarity measure can not be evaluated by reference to anther operationalization of similarity. The only valid way is a ceteris paribus experiment that is tied to the risk profile of the outcome\forecast.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service