In the previous example we used clustering to see if an apparent pattern exists within Brexit tweets. We found out that we have three distinct patterns, the leave, the referendum, and Brexit. This in itself helps us think that we may even create a classifier that can identify if the tweet writer is pro or agains an issue automatically, with no human intervention.

Let's get back to the issues related to clustering. To use the clustering algorithm we had to map 2 tweets at the time to a binary vector. In the initial clustering approach we did not remove low frequency words, in spite of practitioners recommendation. In this blog, we will see if a different mapping of the tweets, different dissimilarity or similarity distances, and if removing low frequency words change the outcome of clustering function. While this work is just a simple exercise, it does help us define issues in our more pressing problems.

From the text mining literature, it appears that practitioners tend to utilize Cosine Distance to compare 2 documents. They have used it with great success. From our previous blog, we also used Cosine Distance and we also found it extremely good and helping us, and our clustering method, get an insight in the UK Exit Referendum. In here, we decided to change our initial conditions and see if we get different outcomes,i.e. different number of clusters or different clusters size. We decided to try 4 others distances: Jaccard, Matching, Rogers Tanimoto and Euclidean. Except of Euclidean distance, all the other three distances work with boolean vectors.

Distance | # Clusters | Clusters Size |

Jaccard | 3 | 381, 667, 613 |

Matching | 3 | 980, 275, 406 |

Rogers | 3 | 980, 275, 406 |

Euclidean | 3 | 980, 275, 406 |

Cosine | 1 | 1661 |

Jaccard Dissimilarity: (N10 + N01)/ (N11+N10+N01) where Nij is the correspond pairs of elements in tweet1 and tweet2, and N is count of numbers of pairs.

Matching Dissimilarity: (N01+N10)/number of words in tweet1 Here we assume that the tweets have the same number of words.

Rogers Tanimoto Dissimilarity: 2 (N10 + N01)/(N11+2(N10+N01)+N00)

Euclidean Distance: Norm[Tweet1-Tweet2] or the square root of the sum of the square differences between values in Tweet1 and Tweet2.

You can find a full list of distances in IBM Knowledge Center

We normalized our tweets, and put each in a matrix of 1661 (number of tweets) rows, and 3239 (number of distinct words in tweets) columns. Jaccard dissimilarty, Matching Distance, Rogers Tanimoto dissimilarity, Cosine Distance, and Euclidean Distance. We printed the results in the following table:

Distance | # Clusters | Clusters Size |

Jaccard | 3 | 381, 667, 613 |

Matching | 3 | 980, 275, 406 |

Rogers | 3 | 980, 275, 406 |

Euclidean | 3 | 980, 275, 406 |

Cosine | 1 | 1661 |

From this table, you can see that except for Cosine Distance, all the other distances resulted in the same number of clusters. Furthermore, Rogers, Euclidean, and Matching all had exactly the same number of clusters and the clusters had the same size. In Comparison of Similarity Coefficients Used for Cluster Analysis Based on RAPB Markers in Wild Olive M. Sesli and E. D. Yegenoglu confirms that Rogers and Matching results tend to be highly correlated, which explains the similar results between Matching and Rogers. They also found very little correlation between Jaccard and Rogers, which also explains the different clustering results.

#### Short Insight Into the Clusters Content:

In the clusters created thank to Jaccard, cluster 2 doesn't have remain or referendum as a top frequency word. Cluster 3 has mostly the mention of **referendum** and **Brexit** with **leave** and **remain** frequencies similar in values. In the clusters created by Rogers, Matching, or Euclidean distances the word **referendum** is the most frequent in cluster 3, but not in the top 10 in cluster 2. Furthermore, in clusters 3 and 2, **Brexi**t is not in the top 10 most frequent word, while **leave **is among the top 10 in all three clusters; cluster 2 in Rogers et.al has the top frequency word **leave**. Hence, while Jaccard distance resulted in the production of 3 clusters their content differs from the clusters created by Matching, Rogers, and Euclidean.

Figure 1. 10 Most Frequent Words for 3 Clusters using Jaccard Figure 2. 10 Most Frequent Words for Matching/Rogers/Euclidean

When we remove low frequency words, words that appear only once in all the tweets, and normalize the tweets, we find that both Euclidean and Cosine return only one cluster. They fail to distinguish between tweets. The use of Jaccard returns only 2 clusters, while Match and Rogers return 3 clusters of size 619, 667, and 360.

Distance | # Clusters | Clusters Size |

Jaccard | 2 | 962, 684 |

Matching | 3 | 619,667, 360 |

Rogers | 3 | 619,667, 360 |

Euclidean | 1 | 1646 |

Cosine | 1 | 1646 |

In here, we did only 2 tweets normalization (see previous blog). As you notice, Jaccard and Matching had the same results; and Rogers, Euclide, and Cosine had the same results. It is interesting that Rogers performance has been consistent irrespective of when normalization has been performed. However, the performance with the other distances has been dependent on the normalization of the tweets.

Distance | # Clusters | Clusters Size |

Jaccard | 2 | 1241, 405 |

Matching | 2 | 1241, 405 |

Rogers | 3 | 619, 667, 360 |

Euclide | 3 | 619, 667, 360 |

Cosine | 3 | 619, 667, 360 |

By the way, we also tried Ochiai Distance which is the binary version of Cosine and, as expected, the results matched Cosine Distance.

#### Short Insight into the Clusters Content:

In the Jaccard and Matching where 2 clusters were found, the first cluster had **Brexit** and **leave** as the most frequent words, while in the second one the most frequent word was referendum. Still, **leave** was in the top 10 most frequent words in the referendum cluster. On the other hand, for the 3 clusters formation, just like the findings in the previous blog, the first cluster had leave and EU as the 2 most frequent words, but it doesn't have the word **eu**; the second cluster had **Brexit** but**leave** was among the top 5 most frequent words; finally, the third cluster had Referendum and remain as the top 2 most frequent words. What is interesting in Figure 4, and Figure 2 is that they match in the top 3 values in 2 clusters: cluster 3 from both approach, and cluster 2 from Figure 4 and cluster 1 from Figure 2. One also notices that the top 3 most frequent words in both clusters in Figure 3 match the top most frequent words in cluster 2 and cluster 3.

Figure 3. 10 Most Frequent Words for 2 Clusters Figure 4. 10 Most Frequent Words for 3 Clusters

The choice of distance, and whether to locally or globally normalize tweets has a direct effect on the outcome of a clustering outcome. While other practitioners have shown a high correlation between Matching distance and Rogers and Tanimoto dissimilarity distance, our little experiment shows different outcome when we perform localized normalization. We were surprised to see that Cosine distance performed the worse during globalized normalization.

Even the content of the clusters dependent on not only normalization (global or local) and if we removed low frequency words. Still, by removing the low frequency word, the clusters gave better insight. Even in the 2 clusters formation, The clusters brought to our attention the fact that the issue was debated into 2 fronts:**leave** and **remain**. Furthermore, they were all talking about referendum or Brexit. The three clusters formation went further by creating a cluster where **leave **is the most talked about. In short, choose your dissimilarity distance carefully, remove low frequency words, and perform local normalization.

Note: I'm still looking for a better word to normalization in this context.

Even the content of the clusters dependent on not only normalization (global or local) and if we removed low frequency words. Still, by removing the low frequency word, the clusters gave better insight. Even in the 2 clusters formation, The clusters brought to our attention the fact that the issue was debated into 2 fronts:

Note: I'm still looking for a better word to normalization in this context.

© 2020 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central