How to diagnose if there is a group problem?
imagine Internet network. As a user you have a router, that is connected to some internet service provider node. Usually, more than one router (user) is connected to the same node. More than one node is located in the same city. If you look relationally, we could describe relations as: City -< Node -< Router (one city has many nodes, one node has many routers).
I have a script that will collect every minute number of errors of each router, so my time series would look like:
All data are in sql database, so it is easy to find which city/node/router has the highest error rate (select sum(errors) from table group by city/node/router). Dataset is to big to fit in python pandas DataFrame (on 16 GB RAM laptop).
Now when we look through time, let's say that on 2018-11-14 from 08:00 to 10:00 routers R1, R2 and R3 are showing increased errors. If all three routers are connected to the same node N1 and no other routers are connected to the N1, I could conclude that maybe there is a problem in N1 node at that period.
If on N1 are connected R1-3, as well as R4 and R5, I can still conclude that problem could be with N1 because majority of routers (R1-3 against R4-5) showed increased number of errors. But if R1-3 and R4-100 are connected to the N1, then errors are probably not connected to N1 because ony 3% of routers on N1 are problematic.
If majority of routers on N1 are showing increased errors, and majority of the routers on N2 as well, and N1 and N2 are majority of the routers in City C1, then maybe city C1 is problematic.
I hope that I managed to describe the problem.
Questions that I would like to answer are:
Can someone just point me to some direction and techniques that should be used for answering such questions?
Sounds like a very interesting problem. It reminds me of the Network Weather Service, but that was maybe more specific to high performance computing.
Do you know a priori what level of error is "problematic" and what is acceptable, or is this the kind of thing where it's not clear going into it what level of error makes a router, node, and so forth problematic? You may want to start with the classification question, look at errors, see if any clusters jump out at you, maybe finding that there some kind of cliff you fall off that takes you from baseline random errors into problematic territory. Maybe it will be some kind of power law thing that will jump right out at you.
This looks to be time series data, in which case just use caution because of timestamp differences, routers that think it's 1970 until they get the time, and so forth.
After you have a working definition of "problematic" and confidence that timestamps are aligned you might want to also cross reference your timestamps with a calendar table (5 PM on a holiday might be different from 5 PM on a workday).
what I have presented as a problem here is a very simplified version of the real problem.
Let's say that I have a clue about which values are problematic.
When you said that you will start the classification question, how you would classify here?
How you would form clusters?
It would depend on the details, but say for example this was at the TCP/IP level, and we were looking at connection time-out errors, and for each error we have a timestamp, then a cluster is formed when we have a "problematic number of errors close together in time", meaning we are going to cluster based on one dimension. Now it depends on the data, but maybe we first try natural breaks to cluster. We want to try clusters on the timestamps, clusters on time of day (e.g., 5 PM one day is the same as 5 PM another day), maybe on the time of day for work days specifically. Laying it out on the time dimension like that may lend itself nicely to a probability distribution function, too. Assuming clusters, though, if we have a good idea what makes for a problematic cluster, we apply those rules to the clusters to create problem events where a problem is some clustering of errors and a problem event has a time, duration, and magnitude, here being an event having such and such number connection time outs over a period of time.
This sounds like a very fun problem. Best of luck, Glupe!