Subscribe to DSC Newsletter

Guest blog post by .

Introduction

As catchy as the title of this post is (which I often do to get audience attention, approvals or argue), the real title should be "What moved the median?". This is a long read compared to my earlier post because I drill down really deep into the problem and the solution.

This post is about how I went about with (some very able colleagues and director) in making a system that tried to automate (with fairly high consistency) the reason for the undesirable shift in median page onload time of a web page (http://www.w3.org/TR/navigation-timing-2/) for the worse. The project met with some great reviews and good success.

Problem Statement

The CEO of one of the companies that I worked at was very particular about the consistency of performance & experience of the web page onload time. Various studies had shown that the onload time of a page had direct relation to visits, unique visits and display ad revenue from the portal that the page belonged to.The CEO wanted to know why the median moved for the worse and which dimensions (geo, browser etc) were seen to have worsened onload time so as to hold accountable the concerned folks or have the learnings to prevent future issues. The CEO wanted a weekly report on the area of concern, the cause and corrective action from my team on a week-on-week basis.

Problem Details

As an example, lets say that a company's portal's median onload time was 2.3 seconds for a given week. 50% of the beacons observed had onload times faster than 2.3 seconds and 50% of the beacons observed had onload times greater than 2.3 seconds. Assume that the week before the onload time was 2.14 seconds. The team had to now report what, where & why this time changed to 2.3 seconds in the current week(6.9% increase).

Each beacon is characterised by a tuple of 8 attributes which are

  1. Experimental Bucket  (used by Product Management for A/B Testing)
  2. Browser
  3. OS
  4. Device Type
  5. Netspeed
  6. City/Country
  7. PoD
  8. Colo

Using the attributes above, we need to localise the problem. Which of these attributes (either by itself or in combination with other attributes) experienced delays.

It wouldn't be hard to notice that a person accessing the portal using an older machine with Windows XP on IE on a cable connection would most likely have an onload time far in excess of 2.3 seconds. However such users would be very few and negligible today to take notice off(a case of fewer beacons would show this). There will however be "Movers & Shakers" or heavy weights among beacon property aggregates that would have likely played a bigger role in the shift of the median.

Approach to the Solution

Lets begin by asking ourselves some simple questions

  • When does the median shift for the worse?
  • How can we use frequent item sets to compute values of attributes whose mean onload time worsened the most?
  • Can we automate extraction of the reasons for the move of the median & are we confident of our solution?

When does the median shift for the worse?

This can be illustrated with a very simple example. Lets take a list of numbers (much like the onload time for various page requests is a set of numbers {1,2,2,2,3,3,3,3,4,4,4,4,4,5,5,5,5,6,6,7,8,9,10,11,12}. This distribution would look like the diagram below. The shape resembles that of the onload time of pages. (knowledge of the consistency of the shape & other distribution characteristics is vital and could be domain specific)

As can be seen above, Median is 4. Put in the context of desirable onload time for a page, you'd like all the users to have a good online experience so you'd want to improve the onload times of pages more than the median. The world is not perfect not all users across the world experience the same onload times for you page. Hence the distribution above resembles a real life one.

Now lets say that based on new week's data, you observed that the median moved to become 5 in the data above. What are three reasons why the median moved for the worse? 

  1. Good beacons from previous time period disappeared. Imagine a few good beacons disappeared as shown below. From the list given above following data points disappeared {2,3,4,5} .
  2. Good beacons from previous time period turned bad. Imagine a few good beacons (that were below the median last period) turned bad this period. So here one 2 became 5one 3 became 4, and two 4s became 5.
  3. Sudden appearance of more beacons on the wrong side of where the median was last week. Here notice 4 new beacons with onload time {9,9,9,12} appeared which shifted the median.

Notice that from the three cases above, we assume we know exactly which beacon went missing, changed or got dropped. In reality, all we have is 500,000(plus or minus a few) beacons for each period. Typically this is kept aggregated in some database of Hadoop HDFS. How do we find out what disappeared, changed or appeared? Here is where we use Frequent Item Sets      ( http://infolab.stanford.edu/~ullman/mmds/ch6.pdf ).

Frequent Item Sets

This is a very interesting Data Mining technique where we can separate the wheat from the chaff (original paper Apriori Algorithmhttps://en.wikipedia.org/wiki/Apriori_algorithm  used in Market Basket Analysis) . For those who know RDBMS Databases, they should be familiar with aggregations from aggregate tables using "group by". Frequent Item Sets is the grand daddy of "group by". It takes a threshold (also referred to as support) and finds out 1, 2,3.. upto 'n' attribute value combinations whose aggregate value (sum or count) is greater than the threshold. 

With Frequent Item Sets from the beacon samples for the given period containing beacon count and sum of onload time

  1. we extract frequent item sets of attributes which are at least 2.5% of the total beacon count and
  2. we extract frequent item sets of attributes which are at least 2.5% of the total onload time.

We next do a full outer join of these datasets and fill missing values that may have been cropped off due to the 2.5% threshold. With this we compute the mean onload time(after outlier removal) for these significant item sets.

This gives us the proverbial "wheat" of frequent item sets which constitute the "Movers and Shakers" in the sample beacon collected for a period.

Now if we do a full outer join of the frequent item sets for two consecutive periods, we can take difference and percentage change in mean onload time and sort by either to get the bad-fellows that very highly likely made the median shift.

As an example, The table below summarises what the Frequent Item Sets is able to extract. You can see that the "Movers and Shakers" in the column "Frequent Item Sets" are the area of concern. Traffic dropped and performance dropped to.  Contrast these item sets with irrelevant ones where only 10 users on Windows XP on IE faced really slow onload times. Frequent Item Sets with the Full outer Joins help setting focus to what should be focussed on.

Correctness & Reliability

As far as correctness goes, practical experiments show the accuracy with which we can identify & localise the shift in median. Even a logical walk down can help one reason for himself why this technique would work.

When the Median shifts, it can be seen that the Mean shifts in the same direction as the Median. I need citation for the same but I wouldn't be surprised if there is a logical proof for the same. With Frequent Item Sets, it was sufficient to compute just the mean and we did not require the median for each of the item sets.

There are many people who raise the issue that what if the data distribution across multiple dimensions does not permit extraction of frequent item sets less than 2.5%? I counter them with questions like  

  • Are you sure of the distribution is uniform across all variables?
  • Would there even be data mining or patterns if each variable (dimension) was uniformly distributed across values of all other variables?

One never knows whats works out and brings value unless one tries.

Automation

This entire process can be easily automated using Java and libraries to derive frequent item sets from beacon data set. It could be computationally hard but with sampling and Hadoop like infrastructure makes life tremendously easy.

Observations

As explained above advanced analytics (rudimentary data mining) easily help hit the problematic nail on its head and automate a complex task which not only relieved people to do better things but also gave much higher accuracy with conclusiveness and avoided human error.

The above Analytics solution is not something that a platforms engineer could have envisioned. I was successful in taking data science to solve a Platform team's problem. If there is data out there, good analysis and engineering can create wonders and cut costs in organisations.

Conclusion

There are bigger take aways from this entire explanation. Many knowledge snippets that I have acquired in my career as a "Data Science & Platforms" person are summarised as questions that Analysts, Data Scientists and Data Platform Engineers must ask of themselves and Product Owners.

  1. When building systems for Analysts, its important to not only know what they want but also why they want it. The earlier approach to the solution I proposed above was an ETL based Interactive Analytics dashboard that the platform engineers used for debugging. I just freed up the folks who did this median movement localisation manually.
  2. Sales and Product Managers seldom have clue about the limitations of computation. What is the point in being accurate to the nth level when a fairly good approximate work just fine with low margin of error? Contrast the highly overrated Unique Counting  vs Sketch.  How would 0.06% of Margin of Error make such a difference?
  3. Beware of the over selling architect. Many Architects are so involved in platforms and Fancy Interactive UI for OLAP operations that they have shut their minds to Analytical Solutions emanating from Data Science and Mining. Dharma (Duty) of a software engineering architect should be to design & build a system that sustains over time with minimal intervention with projections of scale. People argue that this will kill our jobs, but the bright side of it is that we free ourselves to work on other things after building credibility of success stories that we have built.
  4. My inbox has often been overwhelmed with 100s of automated reports each day that no one can possibly read. Data Science and Analytics can help change the reporting to an Alert that would wake stakeholders up only when there is a real problem. Do not cry Wolf with every little alert. Wake me up if there is a real problem.
  5. No matter how much you Automate; no matter how much of OLAP operations like drill downs, drill throughs, rollups and such that your provide; good analysts prefer raw data and adhoc queries will rule. So if you have Data Analysts, just put data on something like Apache Hive and give them read permissions.
  6. Use data to bring value to stake holders. Know the problem well and use various Data Analysis & Mining Techniques to maximise the impact.

ETL Platforms are only 20% of the solution. As Data Scientists and Data Platforms engineers, it is our duty to solve problems and not just make fancy interactive UI to encourage idle curiosity in Data Analysts. There is so much more that the industry can do.

Originally posted here.

Views: 1575

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service