# Growth in Insights vs. Growth in Big Data

I randomly ran into this old thread of Vincent's in DSC regarding Moore's law and its applicability to how we might think about the growth of insights relative to the growth of our data. I took a humble crack at it. No idea if i'm right, but sometimes guessing the solution can be fun and instructive.  Would love to hear your thoughts and insights, below.

The original question reproduced here for reference...

In a nutshell, Moore's law says that every two years, computer capacity (memory, speed and so on) increases by a factor 2. How does this apply to big data? It seems like big data is also growing exponentially, nobody will contest this statement.

Although data is growing exponentially, does information follow the same path? Information is extracted from data: it is the essence of data, what makes data valuable. If Information = F(data), where data is measured in petabytes, and information (say) in entropy units, what is the shape of the F function? If it is linear, it means that information is also growing exponentially. If it is a logarithm, then information is growing only linearly.

By information, I mean information that (1) has been found in big data (sometimes called insights), as opposed to invisible or undetected information and (2) used to provide added value.

My guess is that information is growing far slower than big data, but faster than linear over time (that is, super-linear). It's a bit like the growth of the Windows operating system. Windows might be 1,000 times bulkier than 30 years ago, but most of the new stuff added into Windows is rarely used, it still feels slow at times, and Excel spreadsheets are still limited to 1,000,000 rows. Sure your machine is much faster and has far more memory, but that's not thanks to Windows. In short, multiplying data size by 2, does not result in multiplying useful information by a factor 2.

How would you measure information growth?

First, I would guess that the distribution of insights at any given point in time would resemble attractor(s) confined in a space, bounded by the Data, {X1,...,Xn}.

Second, I would guess that the growth in the total number of "insights" would resemble a sigmoid curve, shifted so that the curve starts at zero, and plateaus towards a definite maximum, proportional to the total insight 'content' of the Data.

Why?

Insights have consequences.  Insights don't exist in a vacuum: People exploit insights to take action.  Those actions have positive and negative consequences. Some insights may lead to actions which reinforce the insight.  Other insights lead to actions that diminish the original insight.  So, I believe that insights are "self-susceptible" in proportional to the opportunities and/or incentives they create.

• Case 1 (Positive Feedback): A thread discusses the famous case of beer and diapers. In a particular store, it was (supposedly) found that people who bought diapers, also happened to buy beer.  So, the store apparently exploited this by shelving the beer closer to the diapers, driving additional beer sales. Here, they discover a prior correlation between two products, and exploit this correlation to drive sales by providing easy access.  So, an insight lead to certain actions that resulted in a strengthening of the original relationship (beer and diapers) and a desirable outcome (product sales).

• Case 2 (Negative Feedback): A trader discovers that a certain stock is highly correlated with another stock, and lags its movement by one day.  This trader decides to exploit this, and happen to make a lot of money in the process.  As time goes on, more and more traders are discovering the correlation and jumping on the same bandwagon, ultimately causing the correlation to disappear.  Here, an insight (correlation between two stocks) led to an action (trading) that diminished the original relationship (as other people caught on and jumped on the bandwagon) to the point where the original relationship disappeared (trade is no longer profitable).

As insights are 'discovered', they cause changes to the system that produced the insights (i.e., relationships among the Data) that either strengthen or weaken the original relationships leading up to the insight.  If we were to "visualize" these insights over time, we might see "regions" of "correlated" behavior (i.e., the insights) that would undergo birth, growth, and decay in response to their dynamically changing environment in which feedback plays a role.  I imagine this looks a lot like Conway's Game of Life.

In the example, we find patterns of emergence, sustenance, and decay. Much like these attractors in their 2-D 'box', I would expect "Insights" to be confined to their N-dimensional box, bounded by the Data, e.g., {X1,...,Xn}. I would also expect that "insights" would grow in proportion to the size of the enclosing space (the dimensionality of the data).

We might formalize our notions using the language of Information Theory. So, instead of correlation, we view the relationship in terms of mutual information (MI is more general, and also captures nonlinear influences).

• MI(Xi,Xj) is defined as {All i,j} P(Xi,Xj)*[ln(P(Xi,Xj)-ln(P(Xi)P(Xj)]
• The term [ln(P(Xi,Xj)-ln(P(Xi)P(Xj)] is the interesting part - it describes the Information Gain, or how knowing Xj reduces our uncertainty about Xi
• The MI(Xi,Xj) is really just a form of the KL divergence between two distributions, A, B, defined as the expectation of [ln(A) - ln(B)].
• For MI(Xi,Xj), the A,B distributions are simply the joint, P(Xi,Xj), and the marginal, P(Xi)P(Xj), distributions
• So, the "Gain" in information about Xi, as a result of knowing additional information about Xj, can be defined as an "Insight" and can be generalized to the multivariate case
• For instance, we can extend this to N=n such variables, {X1,...,Xn}, expressed as MI(X1,...,Xn).
• When we consider the entire set of MI()s from 2-way, 3-way, 4-way, ...to the n-way, we can then cumulate all the possible insights realizable from a set of multivariate data, {X1,...,Xn}.

All that is left is to find a way to subtract out the "hidden" insights leaving only the "observed" insights, per the original formulation of the question. For simplicity, we can assume that "observed" insights, N(o), are some fraction of all possible insights, N(i), but over time they converge to all possible insights given the data (i.e., eventually N(o)/N(i) → 1), reflecting our ever-improving efficiency as insight generators, given the help of technology.

As an aside, I would be curious if it makes a difference if we differentiate between horizontal data growth (more variables, and thus, more associations among more variables) and vertical data growth (growth of longitudinal data, i.e., length of time, or history, of any given variable).

In a finite universe, I would expect an upper limit to the total number of insights as we limit the number of possible associations to be less than the size of the universe, with a subset of those associations providing no insights.  This suggests an initial explosive growth in the generation of insights, that gradually becomes linear,  until it eventually plateaus.  As we approach a critical threshold, this suggests that each new insight will tell us successively less than the previous insight, leading to a sigmoidal shape.

This makes sense as I would expect that we start out as very poor insight generators, but with time and technology, we accelerate to become super-efficient insight generators. Eventually, we reach the limit of total possible insights in a finite universe.

I would also suggest that there may be a threshold point that results in a Malthusian catastrophe as we "know too much" (i.e., loss of privacy, TMI, etc.), leading to strife, war, and eventual death.  To identify this point, we would have to have a measure of the social consequences of insights and insight generation.  My expectation is that this would happen well before we ever reached the limit of insight generation suggested by the size of the Data.

...I apologize that I started this intuitive exercise with Conway's Game of Life, and ended it with a Malthusian vision of Death.

Views: 1119

Tags: big, data, growth, information, insights

Comment

Join Data Science Central

Comment by Cristian Vava on February 26, 2015 at 8:34am

Plato’s original definition of knowledge was “justified true belief”. The first thing to notice is that knowledge doesn’t exist outside the human mind (belief) and its byproducts like algorithms. Not any belief qualifies as knowledge, for example “I believe it rains outside” doesn’t include any justification. However “I believe it rains outside because I can hear it” includes a justification and MAY become knowledge. The reason why “may” is because I could be wrong, for example I may hear a recording of a rain or someone washing the office windows with a hose (hoping I don’t have an early form of tinnitus). If it happens to be raining, then it was knowledge (true belief). The requirement to verify it historically is definitely the most challenging part of building knowledge. Search the web for the Gettier conundrum for an interesting intellectual exercise in infallibility.

Let me give you an (overly) simple example. In hard sciences we take a so called dependent variable and search for (hopefully) independent variables that may predict the dependent one (experimental test design). Then we do some experiments by controlling the independent variables, accumulate enough data, and run a regression to build a model. If the model describes in large part the variance of the dependent variables, bingo we have knowledge. Otherwise we start it over by changing the variables, their range, etc. In other words even if we don’t think of it explicitly we verify the justification before declaring it knowledge. In real life few of us do much pure science (anymore) so we use a shortcut. We build on what scientists have discovered by skillfully combining pieces of knowledge. In my previous example I’m confident the temperature outside was -14C because scientists have verified the relation between the temperature and the electrical resistance of a pure platinum sensor and my thermometer is calibrated.

If the previous process was not complex enough the real challenges will start for sure with the “soft sciences”. Here we cannot control at least some independent variables and sometimes we cannot even measure directly the independent variable. Let’s not forget also that we don’t actually verify the model at best we fail to falsify it which compounds the difficulty. The resulting model becomes a speculation of a second degree.

In Data Science the justification is the data. The degree at which we verified the algorithm used to build a belief from the data places Data Science anywhere from miracle to wild speculation.

My response was triggered primarily by these two statements you have included from a foreign post:

Information is extracted from data: it is the essence of data, what makes data valuable.

By information, I mean information that (1) has been found in big data (sometimes called insights).

Based on the implicit taxonomy derived from these statements if data is infinite then insights must be infinite. Think only about what makes valuable to me or to you or to other seven billion people the mostly unstructured data available on the web. Should we include other forms of intelligent life in the universe and their unstructured data or knowledge created by AI?

I mentioned the AI as a joke but also as an example. We can formulate any hypothesis but the more precise our language is the more likely it is to be falsifiable (or not). When we start to dig into the “AI takes over” hypothesis we may immediately have some major issues with it from “how does it get powered” to “how does it know what this is”.

We may disagree about the definition of knowledge and insight but without a definition we don’t know what we are talking about. Without a precise language we may end up with a war of cults. It is better to make sure the science from Data Science is not false advertisement.

I agree that a strong taxonomy is needed, and that was part of the challenge in this problem. Mathematical language tends to be more precise and limited in use than common usage of words.

I have many questions, but let's keep this simple...

Q: What do you mean by justified belief? How is a belief justifiedvs. a verified justified belief?

I do agree that beliefs should be part of this discussion.  But, are your justified beliefs necessarily the same as data-driven beliefs?  A data driven belief has a definite representation in the form of Belief Networks, which are defined by a set of hierarchical and conditional probabilistic relationships, in the form of directed acyclic graphs.  At the heart of what i'm saying is that the relationships within the data form the set of knowable Insights, with my assumption being that we will eventually know whatever insights are available to be known (as 100% efficient insight extractors).  There are multiple ways to mathematically define a relationship, and certainly they are subject to lax or stringent criteria depending on one's particular definition.  For instance, mutual information is more complete than 2nd order correlation, as it captures higher order properties of a signal.

I don't think that not having a clear taxonomy should stop us from guessing at a solution, even if only to determine where the holes in our logic or thinking might be, or if only to suggest a possible taxonomy.  Would my post for instance be classified as a justified belief, even if it holds no water?

I think we are far from any kind of AI take over.  The sensational notion assumes we understand more than we actually do about the nature of cognition, and that it is somehow replicable in silico.  Besidesmuch like the Lysine contingency in Jurassic Park, our AIs have the natural limitation that they require power, and without it, they would dim rather quickly. Even the energizer bunny's backpack would be insufficient to enable their world domination to proceed without incident.

Thanks!

Comment by Cristian Vava on February 25, 2015 at 5:31am

Although I tend to agree with the main idea that in a bounded and separable space of variables the insights can be refined to a very small set, I have trouble with the joggling of terms inherited from the original article and the fundamental assumptions unsupported until the end.

To give you an example, what do you mean by data, information, or insight and where’s the knowledge? To help you understand the difference I would use the following taxonomy.

Data – is the result of sampling the environment. It is only a set of numbers, structured or not. This is the highest level our automatic processing may reach safely without any human intervention.

Information – is data that has a meaning attached to it through a process called codification. For example a column of data represents the temperature, the next column the price of oil, etc. A computer program is unable to distinguish between sets of data and without “help” all data is just a set of numbers. Until now even our smartest algorithms including the famous Watson machine are unable to decide what a set of data represents and this might put a break on the Artificial Intelligence takeover nightmare.

Knowledge (factual) – is justified true belief that happens to be correct. For those with epistemology at hart I have used Plato’s definition enhanced to answer Gettier conundrum. Knowledge results from a confirmation process. Over and over again our justification was proven correct. For example: I believe the water on my patio is frozen. This is knowledge because my thermometer shows -14C and science has taught us that thermometer reading is a reliable measure of temperature, my thermometer was properly calibrated, and at this atmospheric pressure water starts freezing well above this point. On the other hand “if it looks like a duck, it quacks like a duck, and it walks like a duck it MIGHT be a duck”. Without insight building knowledge could be a painfully slow process.

Insight – is the verified justification of belief. This follows through an abstraction process where we summarize the verified justification in such a way as to support future knowledge. All the fundamental laws of physics, chemistry, and many other sciences are examples of insights but minor insights are more common.  With insight we can limit the data to a small set and still be able to describe the past and predict the future.

Based on this taxonomy even in a bounded space of variables data and information can grow infinitely at least by changing the sampling rate. Knowledge can also grow infinitely although not at the same rate or as fast as data at least because it involves humans. To give you an example let’s say I have a digital thermometer able to sample temperature once per minute. By upgrading it I can get one sample or one million samples per second, which is the exponential (and mostly gratuitous) increase in data. Since in this case we have established that the result of this measurement is temperature then the growth in data is identical to the growth in information. However in many cases our sampling is poorly managed providing data that doesn't represent the information we have assumed, sometimes we don’t know yet what is the meaning of what we are sampling, and last but not least let’s not forget the massive unstructured data that may have meanings well beyond what anyone is suspecting now.

Insights are very different. For example the laws of mechanics governing the movement of bodies are very few describing ALL past, present and future movements of ANY body in ANY environment. These fundamental laws represent the essence of insights. Many more insights are possible describing subspaces of variables or range of values. In a poetic representation we may imagine information as a very shaky plot of data with insights as local extremes. For any set of bounded and separable variables in a clearly defined range there is a single global maximum which may be seen as the fundamental law governing that set. And this may indeed be finite.

In the end many readers may disagree with my views. The fundamental idea I wanted to express is that without a very clear and consistent taxonomy the whole argument has very little chance to hold water.