Introduction and Purpose
Here I put together a number of new data science techniques to solve a real life problem: identifying good articles to write and publish (or to harvest and re-post) on a website, and re-tweet them with the optimum frequency, given a specific audience. The focus is on scoring articles based on selected features (keywords in subject line, author, channel and many more), feature selection, data generation and harvesting to solve the problem, automatically categorizing and tagging articles using indexation algorithms, and predicting the lifetime value and total page views of an article. We also discuss bucketisation, and use both hidden decision trees and jackknife regression for scoring articles.
The methodology applies to many contexts, not just digital publishing. It applies to situations in which a lot of unstructured text data needs to be processed (categorized and scored using natural language processing methods). In the context of digital publishing, our entire system described here can be viewed as IoT (Internet of Things) for the media industry: automated selection and distribution of content via Hootsuite, to optimize pre-specified goals. This involves automated machine-to-machine communications via API's (Google Analytics, Twitter, Hootsuite) to deliver optimum content at the right velocity. Journalists and editors are being replaced by software.
For the data scientist reader, this article is putting together many of the techniques that I recently developed, and represents an interesting case study. It is my final contribution to my upcoming book data science 2.0, though I will write one more book later, data science with Excel. It is also a follow up to my article Data Scientist Shares his Growth Hacking Secrets. Reading this article will help you understand the context.
This article is divided into multiple parts. Some are completed already, some will be written in the coming weeks. From a technical viewpoint, we used Python, R, Excel, web crawling, and API's, as well as modern machine learning.
Figure 1: Impact of new algorithm on traffic statistics (read the conclusion section for details)
Part 1: Data Gathering
The data available consists of pageview counts for thousands of active, live URL's on our websites. The data is accessible from Google Analytics. By crawling our websites using a Python script, we were able to add a creation date as well as title to each URL. Note that a URL represents a web page: an article, book, event, announcement, or generally speaking, any piece of content on our websites. One web page can have multiple URL's (a mobile and desktop version, or a version with a specific query string attached to it, to identify the traffic source). Deduping URL's, to get one URL per page, is an easy process. From these basic, raw metrics, tons of compound, useful metrics are derived: see Part 3, 4, and 5. The basic methodology for data harvesting (with source code) is found here. A recent version of the base data set is found here (CSV file). A more detailed version of the data set, with not just titles but also full content (except images) for each article, is available in the members-only section.
We are considering using an alternate source of data, to increase accuracy. Instead of using Google Analytics, we could use the weekly digests to extract, for each article: the URL, title, creation date (approximately equal to the associated weekly digest time stamp). And by crawling each URL, we could also retrieve the pageview count, as this metric is published on each webpage, on the website itself (it's a public metric).
Crawling is performed once a month, incrementally: each month, we look at URL's found in Google Analytics, but not yet found in our database of previously crawled URL's. We add them to our database, with the relevant metrics for each new URL. A full, comprehensive crawling can be performed slowly in the background or every six months, to eventually detect URL's that no longer exist, and to improve accuracy of both pageview counts and article lifetime values.
Part 2: Indexation Algorithm
This analysis still need to be completed. We will write a separate article in the coming weeks, to complete Part 2, and we will provide all the data and results.
The purpose is to categorize articles using a very simple and scalable technique known as indexation or automated tagging, to attach a pre-specified category or seed keyword (Python, R, Big Data, Hadoop, Excel, NoSQL, Visualization, Clustering, Machine Learning, etc.) to each article. The category is a compound metric: it is derived from the raw metrics, essentially, from the keywords found in the title of each article, or some tokens or keywords found in the URL. Thus the first step consists of producing a listing of all keywords found in all URL's and titles, filter out keywords that don't make sense or with low frequency (ex: "mining data") and keeping the popular ones (ex: "data mining"), and rank them by popularity. This exercise is also important to identify the seed keywords.
An alternative to the indexation algorithm is to use the Google index: do an internal Google search (illustrated - when you click on the link - for the keyword Machine Learning) for pre-specified keywords. These pre-specified keywords are manually-produced seed keywords derived from the big keyword list, as described in the previous paragraph,. Check out (automatically, with a Python script) which articles are returned for each keyword and in which order, and assign the keyword (category) in question to the article in question, somehow taking order into account.
This category metric is expected to have significant value to identify the best articles, and to predict lifetime value of an article. Note that the category is a better metric than the tags entered by authors, for each article. User-generated tags can be erroneous or missing: it's unstructured data. The category generated by our algorithm is a well structured piece of information, by design.
Note: To provide an analogy with chemistry, a compound metric is similar to a molecule or chemical compound. A raw metric is similar to an atom. A molecule is made of several, usually different atoms, bonded in various ways. It can be simple like water, or far more complicated like plastics. Likewise, a compound metric is a function / combination of raw metrics found in the raw dataset. It can be a simple ratio of two raw metrics (pageviews per day) or something more complicated (category assigned to an article).
Part 3: Survival Models
The purpose here is to produce normalized values for pageview counts. An old article obviously has more pageviews than one published yesterday, regardless of quality. Just like in Part 2, the normalized pageview count is a compound metric, this time derived from the observed pageview count and the creation date, for each article. Survival models can be used in this context, but we found a simple solution described in details in our article on page decay: read section 4 of the article in question. In that article, the normalized pageview count is called score. The metric is indeed used to score articles. This problem is considered solved, and the solution (at least a decent approximation) is
Score( article ) = pageviews / (time elapsed since creation - some offset).
In short, we've found that popular articles don't experience observable decay, because the traffic growth over time compensates for the natural decay. So decay is hidden by general traffic growth. For all matters, it is as if there is no decay. Should a decay become noticeable in the future (or to the contrary, if growth more than compensates for the decay), our model can be adjusted accordingly.
Part 4: Metric Selection
We try to predict two things: the popularity or score of an article (lifetime value, decay, pageviews), as described in Part 3, and which articles / with what frequency, to re-tweet over time. The latter is discussed in details in this article. Modeling is further discussed in Part 5 and Part 6.
Here we focus on the ingredients (the metrics) rather than the recipe, to predict popularity of an article. Metric selection is a key feature of any data science project. Doing it wrong leads to bad results. In this particular case, many of the useful metrics are compound metrics. Domain knowledge is useful to create a great list of metrics, to eventually select the best ones.
One critical part is to detect and filter out time-sensitive articles. An event taking place in 2012 should not be re-tweeted in 2016. This is easily achieved based on the URL (containing keywords such as event, competition, press release or weekly digest). So here we assume that these time-sensitive articles have been eliminated from our database.
The first step consists in creating a data dictionary, in this case:
with frequency count and source (title of article, URL path, URL domain, body of article) for each keyword, after eliminating stop words ("the", "what", "when", etc.) Look at all 1-token, 2-token, 3-token keywords; eliminate those with low frequency. This article sheds some light on how to handle keywords, an NLP (natural language processing) technique. A list of top keywords will be published shortly.
Then, we need to identify candidate metrics:
Potential metrics associated with an article
Not all metrics have been gathered in the actual implementation of this project. These metrics are listed for illustration purposes.
Part 5: Jackknife
This analysis still need to be completed. We will write a separate article in the coming weeks, to complete Part 5, and we will provide all the data and results.
We use an approximate but robust regression technique known as Jackknife regression, as the auxiliary scoring system to be blended with pseudo decision trees (see Part 6) to predict the popularity score for each article. Given the fact that scores have a Zipf distribution, we will instead predict the logarithm of the score: we will apply a log transformation to the raw score defined in Part 3, before performing the regression.
Note: for each article, the click rate - the metric that we indirectly try to predict here - is determined mostly by the subject line, and by metrics derived from the subject line (see Part 4); shares/likes are determined by the actual content (the body of the article).
Part 6: Hidden Decision Trees - Putting it Together
This analysis still need to be completed. We will write a separate article in the coming weeks, to complete Part 6, and we will provide all the data and results.
The final part consists of blending the Jackknife regression with a simple, scalable, robust decision tree classifier to produces popularity scores for all articles, based on the metrics identified in the previous steps. Once this part is completed, we will also discuss bucketization, how to smooth regression parameters across multiple regressions performed on multiple buckets, as well as data bucket aggregation.
Finally, fine-tuning the algorithm, to provide more accurate predictive scores, is performed via cross-validation. Performance can be measured using robust, outlier-insensitive metrics such as predictive power or L1 rank correlation between predicted and observed values. Confidence intervals for scores are easily computed, even though no actual statistical model has been used.
As you can see in Figure 1, starting around early March 2015, when we deployed our system in full force, the growth started to suddenly and significantly accelerate. It is as if you have a regression line for data prior to March 2015, and another one with a steeper slope, thereafter. In statistical terminology, the date when the change occurred is called a change point. Note that the accelerated growth is also due to external factors, such as publishing more, high quality articles, but these external factors are somewhat connected with this data science project, and are a by-product of the insights derived. Indeed, our competitors did not experience the same accelerated growth. The numbers in Figure 1 are for one of our channels, and do not represent our entire traffic.
To summarize, this data science project is a success story that helps us, on a monthly basis, identify the new articles worth publishing, and the old ones worth re-tweeting, automatically via @Analyticbridge (100,000 followers), and with the right frequency. Accurately measuring the yield (that is, ROI) is not easy, but clearly the results are spectacular, and can be attributed to this project, as evidenced when we look at traffic statistics - the number of additional, high-quality users coming from this traffic source, and measured by Google Analytics and elsewhere in various ways (pageviews from Twitter or Hootsuite, sustainable spike in monthly new members, pageviews per article, etc.)
Still we continue to also write less popular / more specialized articles, of interest only to a specific audience, or dealing with a peculiar technique or industry: content variance, and originality, is a key component to success. We could perform some A/B tests to find out the optimum balance between core, popular content, and specific, niche articles.