Data Scientist Shares his Growth Hacking Secrets

In this article, we discuss various strategies used to generate exponential traffic growth, while preserving traffic quality, and user loyalty. Our growth hacking engine is a combination of

  • Raw data science: getting the right data sets, leveraging them,
  • Playing with various tools and API's: designing an automated machine-to-machine communication service between Hootsuite and Twitter / LinkedIn based on insights automatically distilled from the following data sources: (1) data obtained via the Google Analytics API (traffic statistics about 50,000 live DSC articles), and (2) data collected via a web crawler written in Python
  • A blend of high-level (strategic) data science and low-level (tactical or operational) data science. In the end, relatively little coding is involved in the process. Domain expertise and smart innovation play a critical role.
  • Optimizing parameters of the statistical process used to select articles, create tweets, and schedule them, using experimental design and A/B testing
  • Artificial intelligence: detection and removal of articles that are time-sensitive, automated creation of relevant hash-tags for selected tweets, and creation of a taxonomy of all our articles using simple indexing classification scheme
  • Smart analytic-driven advertising on Twitter, using a good list of data science thought leaders worth following, as our core data set for advertising purposes. The creation of this list is an interesting data science project in itself.
  • Smart analytical and ROI-driven advertising on Google, as well as LinkedIn hacks, to get new members  

The results are best illustrated in the graph below representing @AnalyticBridge, one of our profiles, and the largest data science profile on Twitter, as well as in this article

1. Growth Hacking: Part I

Here we describe a strategy that consists of tweeting your top articles over a long period of time, to generate incremental traffic. After testing it for one week, we have experienced a 10% growth in traffic. This strategy works well for getting new users, and we believe that it can triple your traffic when fully optimized, though it might reduce user engagement.To get new and loyal subscribers, another strategy is needed: read section 2. This works in fast-growth environments, though you can fine-tune the parameters if applying it to no-growth web sites.


Our DSC network has more than 50,000 live articles at any time, and growing by more than 2,000 new articles per year. Our intern Livan analyzed our Google Analytics statistics, and found more than 2,000 articles each with more than 150 page views - and some with more than 100,000 page views. As we have a Twitter account with 60,000 followers (growing by 5,000 new followers per month at the current growth rate), and a LinkedIn group with 160,000 members (growing by 6,000 new members per month), we asked ourselves the following question:

  • What if we tweet 25 articles each day, from our list of top 2,000 articles, updated monthly?

The answer, from our first tests, is an immediate 10% traffic boost. We could tweet 100 articles per day from that same list, not just 25. We could tweet from multiple accounts, not just @AnalyticBridge, and we could also post on LinkedIn or Google+. With Hootsuite, this process can be fully automated. What would be the impact? Of course there is an optimum: too much tweeting will create dilution. But given the large number of new followers each day, and the fact that the top 2,000 articles could be replaced by entirely new articles after one year (because we produce new articles every day, and we are in the process of automating some postings, such as new books or new salary surveys), is it a clear indicator that 25 tweets a day is well below the optimum. And indeed, we have 50,000+ live articles, so we could tap in the whole list, not just the top 2,000.

Optimizing this tweeting process is discussed later in this article. Note that the way tweets work, it is OK if a user sees a same tweet 2 or 3 times over a one-year time period, as long as on average, he sees many tweets from us only one or two times. And given the fact that tweets are short-lived, even with 100 tweets per day (out of a list of 2,000 tweets updated monthly), randomly selected (according to some selection mechanism slightly favoring, new, or very old, or popular, time-insensitive tweets), we should be fine, if we proceed carefully, incrementally, with constant adaptation to new web traffic conditions whenever they occur.

The idea that very old, time-insensitive articles with few (say 150) page views are worth tweeting again today, is because our traffic grew up by 500 percent over the last several years, thanks to the techniques described here. So old articles were not seen by most of our new visitors. This concept is best explained in our article about the lifecycle of blog posts, discussing traffic decay and how to increase the lifetime and yield of old blog posts. For instance, by having top articles listed in a footer in each new article,  as we have at the bottom of this very article - a footer that can be updated at once across thousands of articles, when needed, using an shtml include or iframe to load the adaptive footer stored in one web location: more on this soon.   


The process consists of five steps:

  • Step 1: Producing/updating each month a list of top DSC articles based on our Google Analytics data, including for each article, the total number of page views. Currently, we focus on articles with 150+ page views. If we want to extract much bigger lists, we would need to use the Google Analytics API for data extraction.
  • Step 2: Scraping DSC (using an home-made web scraper written in Python) to identify in the list created in Step 1, the articles that are still live (not deleted), and for each live article, identify creation date, channel (AnalyticBridge.com, DataScienceCentral.com, BigDataNews.com, DataViZualization.com, Hadoop360.com) and title
  • Step 3: Data cleaning: removing time-sensitive articles, adding hash tags to titles
  • Step 4: Statistical modeling: creating a score for each article, based on page views, creation date, and a random number (see details below)
  • Step 5: The scores attached to each article are based on new simulated random numbers produced every day. Each day, select the top 25 articles based on score, and add them to Hootsuite. Schedule the 25 tweets during the day, over a  4-hour time window corresponding to our peak in US traffic. Hootsuite will automatically generate the shortened URL's to be added in the tweets.

The score can be used to slightly favor (over-tweet) articles that are more recent, or popular. But it is random enough that any article has some chance to eventually be tweeted one day. The score reflects the fact that not all articles are created equal.

The final implementation will consist of a fully automated machine-to-machine communication service (between Google Analytics, Hootsuite, and Twitter), powered by robust black-box analytics, automated machine learning (hash tag creation, detection of time-sensitive articles) and automated, adaptive statistical scoring. 

The number of tweets  can be slightly adjusted each day (increased, decreased, or change in scoring parameters) as a response to performance. Performance is measured in terms of daily clicks arising from this activity (the stats are readily available from Hootsuite analytics), and the resulting average session duration for traffic coming from Twitter (available from Google Analytics).

Details about the scoring algorithm

This algorithm is used to score articles based on page views (denoted as P), creation date (denoted as T for time), and a random number denoted as R (uniform deviate on [0, 1]). Note that older articles tend to have more page views, so P and T are not independent. The score S is computed as follows:

S = (b + R) * P^a / (T-Offset)^c

The parameter a, b, c are chosen so that the top 25 articles selected each day (for tweeting) have, on average, a median P (historical page views count) about twice as high as the median P computed across all 2,000 articles. This way, we slightly favor popular articles, but not too much. Details are in the spreadsheet described below. Offset is chosen so that T = Offset, for our oldest article. You must use the median for P, not the average, because it has a Zipf distribution. Note that page view decay occurs, especially for not popular article, though decay is masked by growth for popular articles, in our case.

Data Sets, Excel spreadsheet 

You can download our Excel spreadsheet with 2,000 articles, featuring the following fields, for each article:

  • Title
  • URL
  • Creation Date
  • Page View Count
  • Channel
  • Randomized Score (column I)

The parameters a, b, c are in cells J2, J3, and K2 respectively. A low value for J3 will produce more random scores. Cross correlations are displayed in cells L1:O4, and the median score for top 25 articles, and for all 2,000 articles, are displayed in cells M8 and M7 respectively.

Note that the cross-correlations are not very useful: even when correlation(P, S) is as low as 0.04, the median P for  the top 25 articles (those with highest S) is twice as much as the overall median score S computed on all articles. This is because traditional correlation is a poor indicator in this context, sensitive to the numerous outliers in the P numbers, caused by the fact that P has a Zipf rather than Gaussian distribution.

You can also download a full data set (for members only) that contains the full text (not just the title), for each article. It is used for clustering articles (see section 3).

Python Source Code

Our intern Livan wrote some Python code to process Google Analytics reports, and scrape DSC articles to extract relevant fields (creation date, channel, and title). Download Python code (rename this text file with a .py extesion after downloading).

Next Steps

We can make this system more powerful by

  • Automatically removing time-sensitive articles, by detecting tokens in the URL such as event, conference, or weekly-digest
  • We could deploy the system not just on Twitter, but on our large LinkedIn group (160,000 members) or on multiple Twitter accounts
  • Deduping duplicate URL's (sharing same path but different query strings)
  • Use top 50,000 articles rather than 2,000
  • Automate some of the content production (new books announcements and salary surveys are easy to automatically produce), to boost our number of tweet-able articles

2. Growth Hacking: Part II

This section quickly describes the other fundamental component required to make our system (described in section 1) work. It is the creation and growth of at least one massive Twitter account, with highly relevant, high value followers, and use of automated tweeting systems. There is a feedback loop in the sense that having a lot of valuable content to tweet, helps generate large volume of good traffic to your website, and helps boost your Twitter growth, which in turn further fuels the traffic growth for your website.

Here, a significant part of our growth (150 new Twitter followers per day) is generated via Twitter advertising: we spend a little more on Twitter than on Google AdWords. With Twitter, it is possible to target US-based profiles (and their followers) that are similar to pre-selected profiles, and you can upload a list of pre-selected profiles when starting your advertising campaigns. Our list has hundreds if not thousands of pre-selected data science profiles. Such lists are easy to find, and regularly published on various websites. But ours also includes top profiles - indeed the very largest, most relevant ones - that are missing in the traditional published lists, as well as people who re-tweet or like our tweets.

The growth and volume of our two main Twitter profiles, @analyticbridge and @datasciencectrl, is displayed in the figure below. It is a few months old: now our number of followers have more than doubled, and we are well above @hmason in terms of number of followers.  

The strategy described in section 1 delivers more than 1,000 extra clicks per day to our network, at the current low levels (25 tweets per day).

We also use LinkedIn and Google AdWords, but for a different goal: generating new members, US-based in the case of AdWords. But we have encountered a number of issues with AdWords (low conversion), thus we have reduced our budget, optimized our Adwords strategies (adding negative keywords and conversion tracking, more on this coming soon), and shifted money to Twitter and to acquire high quality content. Read our article on 360-degree data science to understand how we blend domain expertise, business hacks, machine learning, engineering, and modern statistical science, to efficiently solve business problems in general. And in particular, to discover how we optimize our bidding strategies for Google keywords (how much to pay for a keyword).

3. Growth Hacking: Part III

Another part of our growth hacking strategy consist of creating new channels, for instance:

One of the challenges is to populate these channels with new content. While we use syndicated feeds for this purpose, we also want to add our own content. One way to do so is to perform a clustering of all our articles, and assign them a category: visualization, data plumbing, big data, Hadoop and so on. Once the articles are categorized, we can publish (re-post) some popular articles from DSC on the appropriate sub-channels. Our intern Livan is actually working on this, adding a category field to the list of 2,000 top DSC articles.

Here we describe a very simple and highly scalable NLP (natural language processing) technique, called indexation, to perform this clustering task. It works as follows.

Algorithm: categorizing / clustering articles

  • Step 1: Create a data dictionary of all one-token and two-token keywords found in all articles (both in the title and in the body of the article). This assumes that you crawled all your articles to extract all the text.
  • Step 2: Filter / clean results. Ignore keywords with less than 5 occurrences. Check all n-grams of a keyword (data science and science data) and eliminate n-grams with low frequency, for each keywords
  • Step 3: Look at top 300 entries, called seed keywords. Manually assign seed keywords to top categories. For instance, the top category data plumbing will have the following seed keywords: data engineer, data architect, data warehouse, Hadoop, Spark, data lakes, IoT and many more. Don't forget to have a top category called Unknown.
  • Step 4. Based on keywords found in the title and body of an article, assign the article in question to the top category that has the biggest overlap with the article, in terms of seed keywords. Note that keywords found in the title might be assigned a higher weight than those found in the body. Likewise, a different weight can be attached to each seed keyword, in each top category.

I call this technique indexation because it is very similar to the creation of a search engine; another word that could be used is tagging algorithm. We also have used and described this technique in the context of clustering thousands of data science websites (source code provided).

Instead of using this algorithm, you can just use customized Google Search for your website, and once installed, search for data plumbing to find articles in your website, that are a good fit for the data plumbing category or channel. We've actually implemented it on DSC.

Potential improvement

Also add 3-token keywords in your dictionary. For 3 tokens keywords, you have 3! (factorial 3) = 6 n-grams. Usually, only one or two of these 6 n-grams will show up in the articles, for any keyword (data science central will show up, but central science data won't). 

4. Conclusions

This DSC growth engine illustrates that data science is not just about programming. Indeed, here, programming is a small part of the project, compared with designing algorithms that efficiently make API's communicate with each others, based on data automatically gathered, with insights automatically extracted, and automatically leveraged. It also shows the limitation of traditional statistical science, with correlations (see the sub-section about the scoring engine) that are useless, and replaced by something else.

It certainly shows that there are different types of data scientists, and that indeed, data science is greater than the sum of its parts. It also shows how business and domain expertise are critical. For instance, if you don't know about the Twitter advertising capabilities, nor the Hootsuite product, you will never even think of doing this kind of stuff, no matter how much you know about coding and algorithms, thus missing on a big opportunity. If you work in a bigger organisation, of course finding and convincing the right person to start a project like this one, is a challenge, no matter how much business savvy you are. But my experience is that big organisations tend to hire specialists rather than people like me.

Finally, we invite you to test our list of 2,000 articles, and see which tweets (that is, which articles) resonate best with your followers. It would be interesting to see if articles with high page view counts perform better on your Twitter account (just like they do on ours). And it might be a way for you to further attract followers, by posting stuff that they and many people like to read.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 40550


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by JP Freeley on January 7, 2017 at 6:11pm

Vincent --- thanks for this incredible resource. I'm having a hard time understanding why the "Offset" in the Excel sheet is 38000 .. it doesn't seem to have a basis in the date of the oldest article as you suggest it should. Additionally, how do you select to ensure that Median PV of Top 25 is 2x Median of 2000? I guess I'm not sure how to actually, optimally, implement the excel sheet each day. Thanks again! Glad I found you.

Comment by Shashi Dhungel on August 23, 2015 at 8:42am

Great Article Vincent. I think it is time to provide another update on the growth rate.

Comment by Robert Klein on May 19, 2015 at 6:49am

Great article. Your algorithm inspires me to toss an API your way. It correlates themes in unstructured streams. Love to hear how it handles tasks similar to what you mentioned. -Robert

Comment by Vincent Granville on May 3, 2015 at 7:39am

In fact, this growth hacking system is just IoT for digital publishing, automating content and traffic production via robots (controlled by APIs).

Comment by Vincent Granville on March 24, 2015 at 8:16am

Some more thoughts... why I decided to publish my growth hacking secrets, which in my opinion are nothing more than data science secrets.

It looks simple and it works, yet nobody seem to understand how to replicate it, not even how it works. Even worse, I get comments such as "advertising does not work, thus your advertising-based approach must not work". Very interesting...

Maybe that's the reality with data science: what works well and is simple (in the eyes of the algorithm architect) is not understood by anyone, not even by so-called data scientists, because it blends engineering, business, marketing and science/art all at once. Yet some other stuff seems simple and is understood by most, but does not work or has major flaws.

It's a case where 1 > 1 + 1 + 1 (one genius being bigger than one architect plus one business guy plus one data scientist combined). Or maybe, to put it differently, a data scientist is not a data scientists without business acumen, hacking skills, and serious craftsmanship. More on this in my next blog called "The Handicapped Data Scientist".

Comment by John Irvine on February 17, 2015 at 10:56pm

Thanks for the insight. Definitely will try to use some of this for our OmniContext app as we grow in marketing content.

Comment by Jason Cohen on February 17, 2015 at 12:13pm

Great read. Thank you. I will keep this in mind as I write more content.

Comment by Milton Labanda on February 16, 2015 at 1:58am

Amazing share !!!

Comment by Vincent Granville on February 15, 2015 at 8:45pm

Our intern Livan - who closely monitors these stats via the DSC app that he designed - mentioned that our @Analyticbridge account is now growing by 270 new followers per day, rather than 150 as mentioned in the article. So the growth rate is accelerating.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service