Contributed by Oamar Gianan. He enrolled in the NYC Data Science Academy 12-week full time Data Science Bootcamp program taking place between September 23, 2016 and December 23, 2016. The original article can be found here.
Have you been followed on Twitter or Instagram by someone you don't know? I get this a lot. And so to avoid being thought of as rude, I follow back. Eventually, I got tired of following back when I realized that some of these accounts don't really do anything but collect followers. Now, why would anyone go through all the trouble of following people in the hopes of being followed back? Why would anyone waste so much time on the internet for this?
I eventually realized the answer when I saw that most of these accounts were not personal. A lot of these accounts I encountered were about food, some about beach vacations, and on some occasion accounts with risque content.
Advertising has infiltrated the social network. It used to be just ads on banners but now companies hire personalities on social media to spread the word about their product or event. Companies spend big bucks on celebrities in an effort to publicize their brand and attract a celebrity's fan base. A sponsored tweet could net as much as $13,000 as was the case for Kloe Kardashian in 2013.
Celebrities have multitudes of followers and get paid big bucks by sponsors. So people may have thought that creating accounts and amassing followers would eventually get them sponsorship deals with advertisers. In this exercise, we see that sponsors might be looking for some other things other than the number of followers.
In a social network, a link could represent a relationship as in Facebook or the passing of a tweet as in Twitter. These links determine the flow of information and are therefore a good indicator of a user's influence. I will be presenting two methods of finding potential influencers in a network. One would be by extracting a user's influence measures and the other is by using network graphs.
A large database was found on Followthehashtag.com. The database contained a stream of tweets related to NASDAQ 100 stocks extracted from twitter for 79 days, from 2016 March 28th to 2016 June 15th. This was selected because of a good mix of accounts representing organizations and personalities. The database also contained information about how many times a tweet was passed along and who the original tweet came from. This act, more popularly known as retweeting can be identified in the stream as tweets having 'RT @user' or 'via @ user' at the beginning of the tweet. The stream also contained information about mentions. In twitter, a mention is a public conversation between users. A user calls the attention of another user by mentioning them in a tweet. Mentioning is identified by tweets beginning with [email protected]'.
The influence measures extracted from the stream were the following: indegree, retweet, and mentions. These measures were selected because of how they affect the flow of information in the network. Indegree measures the user's popularity. This was easily extracted from the database by the number of followers a user has. The number of followers shows us the size of the user's audience base. Retweet influence represents a user's ability to create content which other users find worthy of sharing. When a tweet is shared by another user, a bigger network of users is exposed to the tweet. From the stream, this was extracted by counting the number of retweeted messages for each user. The third measure, mention influence, was extracted by counting the number of mentions containing the user's name. This influence measure indicates the ability of the user to engage others in a conversation. This represents the top-of-mind value of the user's name.
A total of 96,613 users tweeted about NASDAQ 100 stocks during the timeframe. Between them, over 680 thousand tweets were broadcast. A word cloud of the NASDAQ symbols most often mentioned shows that Apple, represented by AAPL, was the most tweeted stock among the group.
Figure 1. Stock symbol word cloud.
Users were most active on April 27 where they broadcast over 20,800 tweets. This coincides with the day when AAPL stocks slumped following speculations that iPhone sales may decline by as much as 60 million units compared to the same quarter a year ago. The slump in Apple shares dragged the tech-heavy NASDAQ into the red by the day's end.
Figure 2. Frequency plot of tweets.
Users' activity on this day showed that activity was mostly during trading market hours which is 13:30 to 20:30 UTC.
Figure 3. Frequency plot of 27-April-2016.
Each user's ranking over the three influence categories was assigned by using fractional ranking. For example, in assigning the indegree ranking, a rank of 1 was given to the user with the most number of followers. Users with the same number of followers receive the same ranking number, which is the mean of what they would have under ordinal rankings. Table 1 shows the top 30 users across the three influence measures. Notice that minimal overlap can be seen across each influence rank. The first user to show up across all three measures of influence was "WSJ".
Table 1. Top influentials based on indegree, retweets, and mentions
To see how much users overlap across the three categories, a Venn diagram of the top 100 users was derived. Figure 4 shows that among the 239 users in the top list, only 10 users can be seen across all three measures of influence.
Figure 4. Venn diagram of top influentials across measures.
Figure 5 below shows a correlation matrix which represents how a user's rank varies across the three different measures of influence. The correlation matrix represents the strength of the association between a pair of rankings. This matrix was derived by comparing the relative influence ranks of all 96,613 users in the database.
The resulting network graph of this smaller twitter stream comes up with 431 nodes and 131 edges.
## IGRAPH DNW- 431 181 --
## + attr: name (v/c), Followers (v/n), weight (e/n)
## + edges (vertex names):
##  _bagholder_ ->ppprophet 20trilliondeb ->ppprophet
##  7LadyQ ->eWhispers 7LadyQ ->OpenOutcrier
##  7LadyQ ->WrigleyTom AdaptToReality ->AdaptToReality
##  adelivania ->Benzinga AdvisorboxMedia->MorningstarInc
##  Alain_2012a ->Boursier_com alekskrug8 ->SleekMoneycom
##  AlertTrade ->AlertTrade allgringo ->ChinaInvest
##  AlphSt_Live ->Opinterest AltruistWealth ->eWhispers
##  aTGelstmM ->PersonsPlanet ATPFtrading ->gouluk1
## + ... omitted several edges
There is comparatively more interaction between users compared to our initial network object with the density clocking in at 0.0009550531. The diameter is shorter with just 9 hops across 10 nodes.
+ 10/431 vertices, named:
 TachyonGlobalLL StakepoolCom LMTentarelli tamaraspen2 ppprophet
 diggingplatinum WrigleyTom nixonstocks ACInvestorBlog ProbabilityOne
The resulting hub and authority score show a more consistent result with the ranking tables because the actual number of retweets and mentions were low. This time, the number of unique edges were not significantly lower than the total weight of the edges.
Figure 7 and 8 show the network graphs with the nodes adjusted based on the hub and authority score. The higher the score, the bigger the node size.
Figure 6. CA stream network graph showing the diameter path.
Figure 7. Closeup of network graph with node sizes adjusted based on hub score.
Figure 8. Closeup of network graph with node sizes adjusted based on authority score.
The fractional ranking method is found to be a more realistic measure of a twitter user's influence. The frequency of interactions between users must be considered in measuring influence, even if it is among a usual set of audience. This just means that the user is consistent in producing high-quality content that has pass-along value.
For smaller networks, the network graph method may yield additional information that can't be derived from fractional ranking. The key would be to check whether the ratio of the number of edges to the total edge weight is close to 1. The discrepancy between the ranking method and the network graph is expected to be greater when this ratio approaches zero.
Celli, F., Di Lascio, F., Magnani, M., Pacelli, B., Rossi, L. 2009. Social Network Data and Practices: the case of Friendfeed.
Cha, M., Haddadi, H., Benevenuto, F., and Gummadi, K. 2010. Measuring User Influence in Twitter: The Million Follower Fallacy.
Ognyanova, K. 2016. Network Analysis and Visualization with R and igraph.