Among everything else going on in the world, big data is another controversial topic, and the conversations are all over the place: forums, social media networks, articles, and blogs.
That is because big data is really important.
I’m not saying this only as someone who works in the industry, but as someone who understands the disconnects between what goes on behind the scenes and what’s out there in the media. It’s no secret that quite often big data has a bad reputation, but I don’t think it's the fault of the data so much as how it’s being used.
The internet is the biggest source of data, and what organizations do with it is what matters most. While data can be analyzed for insights that lead to more strategic business decisions, it can also be stolen from social networks and used for political purposes. Among its almost infinite uses, big data can make our world a better place and this article is going to clear up any misconceptions and hopefully convince you that big data is a force for good.
Most of us know what big data is, but I think a quick summary is essential here. We’ve all observed how industry pundits and business leaders have demonized big data, but that’s like demonizing a knife. A minority of people may use a knife for nefarious purposes while the overwhelming majority of people would have a hard time feeding themselves without one.
It’s all about context.
A simple explanation I would give anyone outside the industry is that big data refers to the size, speed & complexity of modern data practices that are too difficult or maybe impossible to process using traditional methods.
Doug Laney, a thought leader/consultant and author, initially used the term expressed as a function of three concepts referred to as “the three V’s”:
Social networks, government bodies, corporations, developer applications, along with a plethora of organizations of all types are interested in what you do, whether you are asleep or awake.
Everything is being surveyed and collected and this has resulted in an entire business sprouting up around the collection of big data referred to as surveillance capitalism.
I think this is the aspect of big data that concerns everyone. So concerned in fact, that many use the terms interchangeably.
Originally coined by Harvard professor Shoshana Zuboff, surveillance capitalism describes the business of purchasing data from companies that offer “free” services via applications. Users willingly use these services while the companies collect the data and then access to the data is sold to third parties.
In essence, it's the commodification of a person’s data with the sole purpose of selling it for a profit, making data the most valuable resource on earth according to some analysts. The data collected and sold enables advertising companies, political parties, and other players to perform a wide range of functions that can include specifically targeting people for the sale of goods and services, improving existing products or services, or gauging opinion for political purposes, among many other uses.
Data collection may have various advantages for some individuals and society as a whole. Consider sites like Skycanner, Google Shopping, Expedia, and Amazon Sponsored Products.
Just a few short years ago comparison shopping required clicking between several sites. Today with a visit to a single site we can get price comparison on almost every type of product or service. All these sites were built around data collection and represent an example of a service some would say is essential to the ecommerce experience.
Data can be obtained in many ways. One common method is to purchase it from developers of applications or to collect it from a social network. The latter is usually restricted to the owners or stakeholders of the application.
Another way is called “web scraping”. This involves the creation of a script that analyzes a page and collects public information. After collecting the information, the scraped data is then compiled and delivered in a spreadsheet format to the end user for analysis. Referred to as the mining process, this is the stage where the data is analyzed and valuable information is extracted, similar to panning for gold among rocks.
Just about any website with publicly available data can be scraped. Some of the most beneficial uses people may be familiar with include:
Whether it’s to book flights, hotel rooms, buy cars or other consumer goods, web scraping is a useful tool for businesses that want to stay price-competitive. The largest benefits accrue to the end-users that are able to source out the lowest prices.
Web scraping can be used to extract information and statistics for a variety of world events that include the news, financial market information, and the spread of communicable diseases.
My company partnered with university students in the United States and Switzerland to support the TrackCorona and CoronaMapper websites that used scraped information from various sources to provide COVID-related statistics.
“Fake News” seems to be everywhere and can spread like wildfire on social networks. Several startups are working to combat the problem of misinformation in the news through the use of machine learning algorithms.
Through processes that can analyze and compare large amounts of data, stories can be evaluated to detect their accuracy. While many of these projects are currently in development, they represent innovative solutions to the issue of false information by tracking it from its source.
Small businesses and new startups looking to get ranked in search engines are in for an uphill battle with the major players dominating page one. Since SEO can be very challenging, web scraping can be leveraged to research specific search terms, title tags, targeted keywords, and backlinks for use in an effective strategy that can help smaller players beat the competition.
The internet provides an almost unlimited source of data that can be used by research professionals, academics, and students for papers and studies. Web scraping can be a useful tool to obtain data from public sites in a wide array of areas, providing timely, accurate data on almost any subject.
Cybersecurity is an increasing field that spans a variety of areas that involve the security of computer systems, networking systems, and online surveillance. Besides corporate/government concerns, cybersecurity also spans email security, social network monitoring/listening, and other forms of tracking that ensure the safety of systems stays intact.
Big data is always changing as it grows and evolves, and part of the evolution should include the formation of some generally accepted ethical practices to keep the space free of corruption and mismanagement.
At Oxylabs, we feel that there are ethical ways to scrape data off the web that doesn’t compromise the ethical concerns of users or the website servers providing them services.
The guidelines for scraping publicly available data should be based on respect to the intellectual property of third parties and sensitivity to the privacy issues. Also, it is equally important to employ practices that protect servers from the overload of requests.
Scraping publicly available data with the intent to add value is another suggestion that can enrich the data landscape and enrich the end user’s experience.
Big data has received a terrible reputation thanks to negative perceptions created by the media with respect to recent scandals. The truth is that this is a very narrow definition of what big data is all about. Big data simply refers to the handling of large streams of diverse data that traditional systems could not process.
Big data has almost unlimited uses with some of the most positive involving optimization strategies that can improve us personally and improve society as a whole. For this reason, factual information should be open and available for everyone.
At the end of the day, it’s about how the data is used, and as an executive of one of the largest proxy providers in the world, I can attest to the fact that there are many innovative players in the world today that are using big data as a force for good.