*Guest blog post by Vijay Rajan.*

As catchy as the title of this post is (*which I often do to get audience attention, approvals or argue*), the real title should be "What moved the median?". This is a long read compared to my earlier post because I drill down really deep into the problem and the solution.

This post is about how I went about with (*some very able colleagues and director)* in making a system that tried to automate (*with fairly high consistency*) the reason for the undesirable shift in median page *onload time of a web page (http://www.w3.org/TR/navigation-timing-2/)* for the worse. The project met with some great reviews and good success.

The CEO of one of the companies that I worked at was very particular about the consistency of performance & experience of the web page onload time. Various studies had shown that the onload time of a page had direct relation to visits, unique visits and display ad revenue from the portal that the page belonged to.The CEO wanted to know why the median moved for the worse and which dimensions (geo, browser etc) were seen to have worsened onload time so as to hold accountable the concerned folks or have the learnings to prevent future issues. The CEO wanted a weekly report on the area of concern, the cause and corrective action from my team on a week-on-week basis.

As an example, lets say that a company's portal's median onload time was 2.3 seconds for a given week. 50% of the beacons observed had onload times faster than 2.3 seconds and 50% of the beacons observed had onload times greater than 2.3 seconds. Assume that the week before the onload time was 2.14 seconds. The team had to now report what, where & why this time changed to 2.3 seconds in the current week(6.9% increase).

Each beacon is characterised by a tuple of 8 attributes which are

- Experimental Bucket (used by Product Management for A/B Testing)
- Browser
- OS
- Device Type
- Netspeed
- City/Country
- PoD
- Colo

Using the attributes above, we need to localise the problem. Which of these attributes (either by itself or in combination with other attributes) experienced delays.

It wouldn't be hard to notice that a person accessing the portal using an older machine with Windows XP on IE on a cable connection would most likely have an onload time far in excess of 2.3 seconds. However such users would be very few and negligible today to take notice off(*a case of fewer beacons would show this*). There will however be "Movers & Shakers" or heavy weights among beacon property aggregates that would have likely played a bigger role in the shift of the median.

Lets begin by asking ourselves some simple questions

- When does the median shift for the worse?
- How can we use frequent item sets to compute values of attributes whose mean onload time worsened the most?
- Can we automate extraction of the reasons for the move of the median & are we confident of our solution?

This can be illustrated with a very simple example. Lets take a list of numbers (*much like the onload time for various page requests is a set of numbers *{1,2,2,2,3,3,3,3,4,4,4,4,4,5,5,5,5,6,6,7,8,9,10,11,12}. This distribution would look like the diagram below. The shape resembles that of the onload time of pages. (*knowledge of the consistency of the shape & other distribution characteristics is vital and could be domain specific*)

As can be seen above, Median is 4. Put in the context of desirable onload time for a page, you'd like all the users to have a good online experience so you'd want to improve the onload times of pages more than the median. The world is not perfect not all users across the world experience the same onload times for you page. Hence the distribution above resembles a real life one.

Now lets say that based on new week's data, you observed that the median moved to become 5 in the data above. What are three reasons why the median moved for the worse?

- Good beacons from previous time period disappeared. Imagine a few good beacons disappeared as shown below. From the list given above following data points disappeared {2,3,4,5} .
- Good beacons from previous time period turned bad. Imagine a few good beacons (that were below the median last period) turned bad this period. So here
*one 2*became*5*,*one 3*became*4*, and*two 4s*became*5*. - Sudden appearance of more beacons on the wrong side of where the median was last week. Here notice 4 new beacons with onload time {9,9,9,12} appeared which shifted the median.

Notice that from the three cases above, we assume we know exactly which beacon went missing, changed or got dropped. In reality, all we have is 500,000(plus or minus a few) beacons for each period. Typically this is kept aggregated in some database of Hadoop HDFS. How do we find out what disappeared, changed or appeared? Here is where we use Frequent Item Sets ( http://infolab.stanford.edu/~ullman/mmds/ch6.pdf ).

This is a very interesting Data Mining technique where we can separate the wheat from the chaff (*original paper Apriori Algorithmhttps://en.wikipedia.org/wiki/Apriori_algorithm used in Market Basket Analysis*) . For those who know RDBMS Databases, they should be familiar with aggregations from aggregate tables using "group by". Frequent Item Sets is the *grand daddy* of "group by". It takes a threshold (also referred to as support) and finds out 1, 2,3.. upto 'n' attribute value combinations whose aggregate value (*sum or count*) is greater than the threshold.

With Frequent Item Sets from the beacon samples for the given period containing beacon count and sum of onload time

- we extract frequent item sets of attributes which are at least 2.5% of the total beacon count and
- we extract frequent item sets of attributes which are at least 2.5% of the total onload time.

We next do a full outer join of these datasets and fill missing values that may have been cropped off due to the 2.5% threshold. With this we compute the mean onload time(*after outlier removal*) for these significant item sets.

This gives us the proverbial "wheat" of frequent item sets which constitute the "Movers and Shakers" in the sample beacon collected for a period.

Now if we do a full outer join of the frequent item sets for two consecutive periods, we can take difference and percentage change in mean onload time and sort by either to get the bad-fellows that very highly likely made the median shift.

As an example, The table below summarises what the Frequent Item Sets is able to extract. You can see that the "Movers and Shakers" in the column "Frequent Item Sets" are the area of concern. Traffic dropped and performance dropped to. Contrast these item sets with irrelevant ones where only 10 users on Windows XP on IE faced really slow onload times. Frequent Item Sets with the Full outer Joins help setting focus to what should be focussed on.

As far as correctness goes, practical experiments show the accuracy with which we can identify & localise the shift in median. Even a logical walk down can help one reason for himself why this technique would work.

When the Median shifts, it can be seen that the Mean shifts in the same direction as the Median. I need citation for the same but I wouldn't be surprised if there is a logical proof for the same. With Frequent Item Sets, it was sufficient to compute just the mean and we did not require the median for each of the item sets.

There are many people who raise the issue that what if the data distribution across multiple dimensions does not permit extraction of frequent item sets less than 2.5%? I counter them with questions like

- Are you sure of the distribution is uniform across all variables?
- Would there even be data mining or patterns if each variable (dimension) was uniformly distributed across values of all other variables?

One never knows whats works out and brings value unless one tries.

This entire process can be easily automated using Java and libraries to derive frequent item sets from beacon data set. It could be computationally hard but with sampling and Hadoop like infrastructure makes life tremendously easy.

As explained above advanced analytics (rudimentary data mining) easily help hit the problematic nail on its head and automate a complex task which not only relieved people to do better things but also gave much higher accuracy with conclusiveness and avoided human error.

The above Analytics solution is not something that a platforms engineer could have envisioned. I was successful in taking data science to solve a Platform team's problem. If there is data out there, good analysis and engineering can create wonders and cut costs in organisations.

There are bigger take aways from this entire explanation. Many knowledge snippets that I have acquired in my career as a "Data Science & Platforms" person are summarised as questions that Analysts, Data Scientists and Data Platform Engineers must ask of themselves and Product Owners.

- When building systems for Analysts, its important to not only know what they want but also why they want it. The earlier approach to the solution I proposed above was an ETL based Interactive Analytics dashboard that the platform engineers used for debugging. I just freed up the folks who did this median movement localisation manually.
- Sales and Product Managers seldom have clue about the limitations of computation. What is the point in being accurate to the nth level when a fairly good approximate work just fine with low margin of error? Contrast the highly overrated Unique Counting vs Sketch. How would 0.06% of Margin of Error make such a difference?
- Beware of the over selling architect. Many Architects are so involved in platforms and Fancy Interactive UI for OLAP operations that they have shut their minds to Analytical Solutions emanating from Data Science and Mining. Dharma (Duty) of a software engineering architect should be to design & build a system that sustains over time with minimal intervention with projections of scale. People argue that this will kill our jobs, but the bright side of it is that we free ourselves to work on other things after building credibility of success stories that we have built.
- My inbox has often been overwhelmed with 100s of automated reports each day that no one can possibly read. Data Science and Analytics can help change the reporting to an Alert that would wake stakeholders up only when there is a real problem. Do not cry Wolf with every little alert. Wake me up if there is a real problem.
- No matter how much you Automate; no matter how much of OLAP operations like drill downs, drill throughs, rollups and such that your provide; good analysts prefer raw data and adhoc queries will rule. So if you have Data Analysts, just put data on something like Apache Hive and give them read permissions.
- Use data to bring value to stake holders. Know the problem well and use various Data Analysis & Mining Techniques to maximise the impact.

ETL Platforms are only 20% of the solution. As Data Scientists and Data Platforms engineers, it is our duty to solve problems and not just make fancy interactive UI to encourage idle curiosity in Data Analysts. There is so much more that the industry can do.

*Originally posted here.*

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- Optimization and The NFL’s Toughest Scheduling Problem - June 23

At first glance, the NFL’s scheduling problem seems simple: 5 people have 12 weeks to schedule 256 games over the course of a 17-week season. The scenarios are potentially well into the quadrillions. In this latest Data Science Central webinar, you will learn how the NFL began using Gurobi’s mathematical optimization solver to tackle this complex scheduling problem. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- Optimization and The NFL’s Toughest Scheduling Problem - June 23

At first glance, the NFL’s scheduling problem seems simple: 5 people have 12 weeks to schedule 256 games over the course of a 17-week season. The scenarios are potentially well into the quadrillions. In this latest Data Science Central webinar, you will learn how the NFL began using Gurobi’s mathematical optimization solver to tackle this complex scheduling problem. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central