What clustering method is required for text documents

Let's say a set of documents 'S' has a large set of 'pure' texts.

On all documents in S, I am spelling normalisation method, which yields a normalised set S'.
Then I use the chosen method M (which method? ) to make clusters in S, obtaining a clustering result C.
Then I use the same method M to make clusters in S', obtaining a clustering results C'.
Finally I need to compare if there are statistically significant differences between C and C'.

Any help in identifying what technique or method (M) I should use for clustering the text documents?

Views: 429

Tags: Text, clustering, document


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by MUSHTAQ AHMAD on May 27, 2015 at 9:41am

Vincent, thanks a lot

Comment by MUSHTAQ AHMAD on May 27, 2015 at 8:47am

Thank you for the hint. This is what I am thinking to use tf-idf, but how to do the automatic? 

Comment by Vincent Granville on May 27, 2015 at 8:41am

What about indexation or automated tagging?

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service