Defining Big Data

Defining big data is now a hot topic. Berkeley University posted 40 very short definitions by thought leaders (including me). Here our goal is to offer a very detailed, comprehensive definition that (hopefully) suits everyone.

First, there are three layers of data:

Level 1

This is about collecting data via sensors, log files, or any other data/signal capture mechanisms. It involves RAW internal data (NASA videos of the night sky to identify exo-planets) and external data (vendor or third-party data, Internet data such as tweets about your company).

Typically, Level 1 is big data, but it can be very sparse, easily compressed (with low content value, read this reference), or rather static (not flowing very fast) making this data looks large - but possibly shallow - rather than big.

Also note that there are two ways for data to qualify as big data:

  • Absolute big data: more than 10 terabytes per year, or more than 1 terabyte for specific analyses (such as root cause analyses) - few companies are dealing with this volume of data; afterall, the Census data - about 300 million Americans - takes much less space.
  • Relative big data: 10 times bigger (per time unit) than anything you've been dealing with in the past, requiring new tools, employees, and methodology for your company

Collecting level 1 data is mostly a question of data plumbing.

Level 2

Here we are dealing with data summarization: deciding which metrics to track (raw and compound metrics), how to store and blend the somewhat cleaned, curated data (using database architecture - SQL, NoSQL, NewSQL, Hadoop, grapth databases). This is the data organizing step.

Level 3

At this stage, we are dealing with highly refined data, possibly no longer big data - available as reports, visuals, email alerts, automated (machine-to-machine) bidding, or detection of taxpayer accounts worth an IRS audit. In short, actionable data and insights. In some cases, it is still big data, e.g. in the context of scoring all credit card transactions, predicting the value of any home in US (including trends), or producing a LinkedIn profile connection gragh. Most of the time, however, it is not big data. Whatever data it is, this is the data summarization step.

Hierarchical data level interactions

Data from level 1 is rarely accessed by level 3 data scientists. But sometimes it is, for instance accessing raw log files (level 1) to analyze a fraud case (level 3). So there are passerelles - and feedback loops - between the three levels.

We will also publish an article about the uplift provided by big data - broken down by industry - over the baseline consisting of leveraging small data only. It is our belief that big data is cheap and easy, compared with small data. 

Additional Reading

  • Big Data Queen

    Vincent,  Big Data is surely a big deal. We definitely are seeing an increase in activity with companies responding to the impact big data has made on their business. For companies any size, getting meaningful insights from data analytics is an important priority. LexisNexis has open sourced its HPCC Systems big data platform which represents more than a decade of internal research and development in the big data analytics field. Designed by data scientists, their built-in libraries for Machine Learning and BI integration provide a complete integrated solution from data ingestion and data processing to data delivery. More at http://hpccsystems.com

  • Vincent Granville

    Examples of companies that truly need big data technology are:

    • Zillow: estimating value of all homes in US, every week
    • LinkedIn: to produce clustering of all members based on their connections
    • Visa: to process each credit card transaction in real time (fraud detection)