Social media provide a low-cost alternative source for public health surveillance and health-related classification plays an important role to identify useful information. We summarized the recent classification methods using social media in public health. These methods rely on bag-of-words (BOW) model and have difficulty grasping the semantic meaning of texts. Unlike these methods, we present a word embedding based clustering method. Word embedding is one of the strongest trends in Natural Language Processing (NLP) at this moment. It learns the optimal vectors from surrounding words and the vectors can represent the semantic information of words. A tweet can be represented as a few vectors and divided into clusters of similar words. According to similarity measures of all the clusters, the tweet can then be classified as related or unrelated to a topic (e.g., influenza). Our simulations show a good performance and the best accuracy achieved was 87.1%. Moreover, the proposed method is unsupervised. It does not require labor to label training data and can be readily extended to other classification problems or other diseases.
Keywords: Big data, Word Embeddings, Word2Vec, Machine learning, Natural Language Processing, Clustering Process, unsupervised classification
The method consists of 3 steps:
Step 1: text preprocessing - Social media are informal, less structured, and contain misspellings and non-textual information.
Step 2: clustering process - This step divides a tweet into clusters of words.
Our clustering process is inspired by the following picture - the words are clustering together with topics in a vector space.
A tweet can be divided into clusters of words
Step 3: similarity measure
Cosine similarity is used to identify whether a cluster is related to a topic
Two vectors are highly similar if their cosine similarity value is approaching 1.
PDF Download Links
Views: 46001
Tags: Big, Clustering, Embeddings, Language, Machine, Natural, Process, Processing, Word, Word2Vec, More…classification, data, learning, unsupervised
Comment
The recent classification approaches in public health are summarized into three main categories: keywords-based approaches, learningbased approaches and lexicon-based approaches
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central