Subscribe to DSC Newsletter

What clustering method is required for text documents

Let's say a set of documents 'S' has a large set of 'pure' texts.

On all documents in S, I am spelling normalisation method, which yields a normalised set S'.
Then I use the chosen method M (which method? ) to make clusters in S, obtaining a clustering result C.
Then I use the same method M to make clusters in S', obtaining a clustering results C'.
Finally I need to compare if there are statistically significant differences between C and C'.

Any help in identifying what technique or method (M) I should use for clustering the text documents?

Views: 394

Tags: Text, clustering, document

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by MUSHTAQ AHMAD on May 27, 2015 at 9:41am

Vincent, thanks a lot

Comment by MUSHTAQ AHMAD on May 27, 2015 at 8:47am

Thank you for the hint. This is what I am thinking to use tf-idf, but how to do the automatic? 

Comment by Vincent Granville on May 27, 2015 at 8:41am

What about indexation or automated tagging?

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service