Scraping 101

Finding the quality of a tennis player by calibrating and analyzing the aces notched up by tennis players or predicting the next Pele or Cristiano Ronaldo after training the machine with the goals scored by the football players – and many such problem statements  if truly answered would receive acclaim world over. All these problem statements have one thing in common – all of them require huge(read – humongous) amounts of data.

As you enter this world of machine learning or data science, you realize –

what numbers are to mathematician, data is for data scientists.

One just cannot live without the other! The sheer existence of this realm of computer science stands on the fact that no matter what one feels, no matter what one’s opinion is – there is one eternal truth – Data does not lie. It just can’t!

Data (which is a funny English plural, of the often forgotten singular form – datum) is the only truth.
So what is data? – facts and statistics collected together for reference or analysis – as the dictionary says. But simply stated, it is fodder for analysis.

Miracles can be made by intuition, gut feeling, notions and beliefs – however fact of the matter is – In today’s day and age, unless you have data to back up your standpoint – you might as well not have a standpoint. One cannot just go ahead taking crucial decisions without banking on the data to support them. This massive potential that data holds presents the need to get the data. By that I mean – data extracted from the right sources, in the right manner, to a right extent, by the right people, at the right time, serving the right purpose and reaching the right conclusion!

In comes the problem of Data extraction, for one cannot necessarily rely on the open source datasets available. Well, if you want to crunch data about cancer, or American football league numbers, you can blindly go ahead on the well-kept datasets provided online. However, often we come up with requirements that are specific and data that meets those specifications is hard to find. It thus gives rise to the need for collecting “relevant” data. While data that has been extracted from reliable and pertinent sources is nothing short of gold, many a times random, adulterated and incomplete streams or packets make their way into our data lake. Such a mishandled and poorly organized extraction is equivalent to fixing an arcane program running with 1000+ LOC . One would rather build a new mechanism instead of getting into the mess involved in such a cleaning exercise(data or code). All of this highlights that collecting data is more of an art than a science. It’s been some time now that I have tried my hand at Scraping for this realization to sink in.

Click here to read more.