Subscribe to DSC Newsletter

Should Sunday, March 8, be declared the data glitch day of the year? Millions of analysts, data scientists and webmasters checking their data, found a chart like the one below, yesterday. I was one of them, and initially I though that we experienced a severe website failure yesterday around 3 am.

Statistics from Google Analytics, March 8, 2015

Usually failures are not as drastic: it does not drop suddenly all the way down to zero like a falling plane. This one was so unusual. Also, I was expecting this Sunday to be our greatest Sunday ever on our network (in terms of web traffic statistics), and was really disappointed when I first noticed the dip. We eventually did make it the best Sunday ever, despite the glitch.

After trying to understand what happened, I realized that there was never a 3 am time on March 8, because of daylight savings time. Indeed, that's how I learned about the daylight savings time.

I propose that we make this Sunday the data glitch day of the year, as a reminder that all data sets are subject to glitches. Ironically, in this case, it's not a data glitch (such as failure to load in database, or server down), but a reality glitch, 

We all know that the path from reality to predictions consists of four steps:

  1. The real world
  2. Data representing elements of the real world
  3. Predictive models relying on data
  4. Predictions derived from models applied to data

Many times, data is messy and can be the weak element in this chain. Not here, it was indeed a real world glitch!

This brings a few interesting issues:

  • Did the reality glitch (daylight savings) triggered millions of alarm systems to issue false alarms?
  • Is daylight savings really saving us money or not? Did it ever serve its purpose? Are there non-monetary side effects?
  • What if you did some automated statistical testing when it happened, such as A/B testing with even hours assigned to control and odd hours assigned to test?
  • Impact on extreme values and records: we almost failed to make this day the best Sunday ever, because the day artificially missed a whole 60 minutes.

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 821

Reply to This

Replies to This Discussion

Hi Vincent,

This is a great article that highlights a common mistake people make with regards to logs and time stamps. I would however disagree with you and state the problem is indeed in the data or at least the interpretation of the data. In reality, time has not changed, only the representation of time in your local timezone. Time stamps in log files are(or should be) recorded in UTC format or local time + offset from UTC. Ignoring the offset is ignoring an important piece of the data and that appears to be the case in the graph shown.

Correlating the data to real world events is no different than comparing lunchtime in New york vs lunchtime in LA. They happen at the same time. Right?

RSS

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service