Any author would like to know if his/her article will be successful or not. Here is an attempt to deal with this task.
Data and tools
We crawled 5000 URLs and for each URL we downloaded the title, body of the article and parameters: number of likes (not including Facebook likes), number of comments, number of views, article creation date and date of the last comment.
First, we got rid of empty (or deleted), very short (less than 100 characters long) and “not found” articles, thus getting 2000 articles with associated parameters. Then we removed articles with missing parameters and ended up with only 1207 articles.
Second, for every article we conducted tokenization of words. We deleted all the punctuation marks, stop words and all words (non-abbreviations) shorter than 3 characters. Considering the fact, that Data Science Central is a community dedicated to special topics (most of terminology is not common), we used 2000 most common English words as stop words list. Thus, we saved all the terminology for topic identification and significantly reduced the size of vector space for topic modeling. For all tokenized words, we carried out lemmatization.
Exploratory Data Analysis
We used plots to detect outliers. Here you can see some of them:
Figure 1. Plot of number of views versus bounce rate
Figure 2. Plot of number of views versus number of unique visitors
Basing on the plots we detected and removed the outliers (there were 7 of them)
To get the most probable number of topics for all articles we conducted cluster analysis on different number of topics. To do this, we used hybrid topic modeling/clustering algorithm consisted of well-known LDA topic modeling algorithm (with Collapsed Gibbs Sampling for inference), Gaussian Mixture Model clustering algorithm and hard clustering approach. With this algorithm, we iteratively clustered articles into different (from 2 to 40) number of clusters, estimating Silhouette coefficient (internal clustering validation measure) on every iteration (to learn more about this approach please read this article). Thus, we considered 6 to be the best number of clusters and topics as Silhouette coefficient indicated an obvious peak (see Fig. 3).
Figure 3. Plot of number of topics versus Silhouette coefficient values
After that, we obtained keywords for 6 topics to analyze if this number of topics was indeed the best choice. As soon as 6 topics were really the optimal choice, we estimated TF-IDF scores for every word in every topic to later use for modeling.
The first question to be answered during the modeling phase was “how to measure success of an article?”
So, we decided to evaluate success as the combination of parameters (page views, likes, number of comments and lifetime) for an article belonging to certain topic compared to mean/median value of parameters of other articles belonging to this topic.
We estimated minimum, maximum, mean and median values of page views, likes, number of comments and lifetime for all the articles in every cluster. To estimate lifetime of an article we assumed that the lifetime should be a period when someone reacts to the article with comments, so we considered lifetime the period between the day when the author posted the article and the day when the last comment was published.
We consider the article normal if the actual value of at least 3 of 4 of parameters falls between the median and mean value of parameters for the topic. If 2 or 3 parameters are above the mean value – we consider the article partially successful. If the actual parameters of the article are all higher than mean values – we consider the article successful.
We conducted topic modeling of words with LDA (+Collapsed Gibbs Sampling) into 6 topics and extracted 500 top key words for every topic with their TF-IDF values. Then we created 1200x3000 training matrix (where 1200 was the number of articles and 3000 was number of keywords, matrix was not Boolean – we filled it with appropriate TF-IDF weights) and trained a supervised classifier.
We carried out multiple experiments in order to choose the most suitable classifier and classifying strategy. SVM and Random Forest classifier showed the best results, so we ended up with SVM with linear kernel (C=0.5) and one-vs-all strategy for training.
The result on 10-fold cross-validation was as following:
In this section, we will answer the DSA questions.
How would you handle 500,000 pages rather than 5,000:
Question 1. The crawling, using distributed architecture, maybe 2-3 computers, or at least multiple threads of this process running on the same machine. You might find that running 20 web crawling processes in parallel on the same machine will boost performance by a factor 5. Explain why the improvement is less than a factor 20, but much bigger than a factor 1. How to design a web crawling algorithm that can be resumed with one click (from where you left just before the crash) if you lose power or your Internet connection suddenly goes off?
Answer: The idea is that if we want to run a multi-threaded crawler with 20 threads on a, say, PC with two or three cores we need to know that we must synchronize the threads of the crawler. We must do this in order to improve the performance of the crawler by having common storage to write crawled data in or/and common storage of URLs for all threads (just to get rid of duplicating crawled data by parsing the same URL multiple times). Of course, if the storage is common, only one thread may write (or read) data into it at the same time, so other threads must wait. In addition, there are even more limitations for multi-threading like CPU performance and internet connection bandwidth. That is why performance boosting will not be factor 20, unfortunately.
But, to be frank, we may boost our crawler’s performance almost by a factor 20 by having 20 cores (one core per one thread) on our PC and totally no synchronization between threads of a crawler.
The simplest and most obvious solution to create an easily “resumable” web-crawler is the chekpointing strategy. Checkpointing is “writing a representation of the crawler’s state to stable storage that, in the event of a failure, is sufficient to allow the crawler to recover its state by reading the checkpoint and to resume crawling from the exact state it was in at the time of the checkpoint.” (from this article)
So, in the event of a failure, any work performed after the most recent checkpoint is lost, but none of the work up to the most recent checkpoint.
Question 2. The clustering process. How to scale from 5,000 pages to 500,000?
Answer: To handle 500 000 documents using the current algorithm we would try to lower the dimensionality for feature matrix for topic modeling by:
1. Using stop-word list of 5000 most frequent English words
2. Selecting words (only nouns, verbs and adjectives) with high TF-IDF score – say, 0.7 and higher
3. Getting rid of very small articles, say, less than 500 characters long.
These actions will significantly decrease the size of an input feature matrix for topic modeling. For example, authors of this article succeeded to decrease a number of unique tokens from 272,926 to 65,776 for 2,153,769 documents. Assuming that our measurements will decrease the dimensionality of matrix to 450 000 x 65 000 that must be enough to conduct topic modeling on one PC with descent performance.
Topic modeling with LDA will decrease the dimensionality of feature matrix to m x n where m is number of articles and n – number of topics. Values of the matrix are probabilities of document to belong to particular topic. Assuming that we may have about 500 topics for 500 000 articles, the input matrix for clustering will be 450 000 x 500. This is also ok for being processed on PC (even keeping in mind high computational complexity of clustering).
There are, however, other strategies to handle large set of articles (we would use it to handle 1-2 million articles or more):
1. Distributed topic modeling and clustering algorithms, like Approximate Distributed LDA (perhaps, with variational EM algorithm for inference rather than Gibbs Sampling, even Collapsed one) (see, for example, this article)
2. RRI (Reflective Random Indexing) to decrease the dimensionality for clustering.
3. Other algorithms for large-scale clustering of text data (not connected with topic modeling).
Question 3. Keyword recognition. How to make sure that 'San Francisco' or 'IT Jobs' are recognized as one keyword, rather than being broken down in two tokens, or 'IT' being mistaken for the stop keyword "it" and just ignored?
Answer. In general, this class of problems is called Entity recognition.
There are some simple yet descent algorithms mostly connected with word co-occurrence detection:
1. Simple n-gram approach – just consider most frequent n-grams collocations (simple but not very efficient)
2. Entity recognition using (Frequency based) Symmetric Conditional Probability – has descent quality of entity recognition (we’re likely to use it for general entity recognition task in our work)
3. Other statistical methods like computing mutual information for a pair of words and then choose pairs with highest MI score.
4. You name it
There also exist some methods to deal with Named entities:
So, there exist many methods to recognize entities successfully enough. There are also many implementations (mostly for English) of statistics/corpus-based NERs, like Stanford NER.
In addition, of course, we need to make use of some heuristics to improve the quality of our entity detector, like using case-sensitive approach in recognizing collocations to not to confuse IT in “IT Jobs” with “it”.
Question 4. Machine learning. How would you update the clusters every 6-month, to reflect potential new trends in the data, without re-running the whole algorithm? Is 6 month a good target? What would be the ideal updating schedule? Daily? Monthly? Yearly?
Answer. Having read several articles about current and future trends in Data Science, we assume that every year we may have about 5-6 new trends. Let us assume once again that 2-3 of them will contain new terms and, possibly, topics. Thus, we had better update our model 2-3 times per year (every 4 or 6 months) to keep up to date with new technologies and words.
For the current experimental prototype, it is impossible to update clusters without rerunning the whole algorithm, because we need to reconstruct vector space using new TF-IDF weights for existing words and new words as well in order to detect the presence of new topics. Therefore, we need to conduct both topic modeling and clustering.
However, we may try to use RRI instead of LDA, which may allow us just to add new words to the existing model.
Possible improvements for the model and the method
We have created a prototype to predict the number of topics, the most probable number of page views, likes, comments, and estimate possible lifetime based on the topic of the article with the precision of 73-75%. The model also can decide whether the article is successful or not based on current parameters’ values.
We found out that text length and author’s name significantly affect general article successfulness; the title may affect the number of article views. We also think that not only number of comments is significant for the article successfulness, but the distribution of comments in time as well. These findings will be used in our further research.
Unfortunately, current experimental model cannot predict the success of the article based only on the text and title of an article. Furthermore, the model is currently unavailable to estimate and predict topic trends.
We plan to continue working on the problem and conducting experiments in order to create good and scalable solution for estimating successfulness of articles and predict topic trends. We will publish the results of the further research in the blog.
We also plan to publish here some articles dedicated to NLP algorithms for data analysis and other interesting things.
Here you can find the topics by predicted mean/median successfulness by descending order (1 - most successful, 6 - less successful)
1. Data Science and web-analytics (median 2550 views, 3 likes, 1 comment)
2. Statistics/mathematics (median 1300 views, 2 likes, 0 comments)
3. Programing/Machine learning (median 1100 views, 1.5 likes, 0 comments)
4. Big Data analytics/Databases (median 900 views, 0.5 likes, 0 comments)
5. Other/Research (median 650 views, 0 likes, 0 comments)
6. Business/Predictive Analytics and Education (median 490 views, 0 likes, 0 comments)
What types of articles tend to be most successful?
1. Tutorials and How-to-do-smth-yourself articles tend to be successful
2. Any posts written by famous Data Scientists (especially by Vincent Granville) tend to be successful
3. Lists of interesting web-resources/books/etc tend to be partially successful (views and likes above average)
4. Long articles tend to be relatively more successful than short ones.
Top-10 key words for the topics:
Data Science and web-analytics:
Statistics and mathematics:
Programing and Machine learning
Big Data analytics/Databases
Business/Predictive Analytics and Education