Subscribe to DSC Newsletter

I am building Matching Alogoritm using ML.Project is to match Internal customer data with external customer data.Features are names,address,city,state and zip.

We create pairs between data sets and calculate cosine similarity and then pass cosine values for all features pairs to Gaussian Mixture model.We started with 2 cluster, with expectation of one match cluster and one no match cluster.But ML does not build one match cluster and matches are in both the clusters.

Before passing to ML, i use Standard scaler and minmax scaler , but still don't get a clear nomatch and match cluster.If we increase the cluster same thing happens.

Match could be High cosine similarity in Name,Address,State,City & zip or Name ,address ,zip or any other combinations.We are dealing with huge volume , so we are using Spark ML.

How can we achieve optimal clustering?

Views: 154

Reply to This


  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service