Not All Data is Useful. An Insight into Data Fitment Analysis.

There is a tendency, even among people who should know better, to view the data that one has access to in an organization as being of perfect quality and utility. In reality, the data that any organization collects over time can range from being highly useful to a waste of computer cycles and processing effort, and an effective part of any data strategy is understanding what is a treasure and what is, to put it simply, an eyesore..

1. Entropy

Entropy is a measure of uncertainty associated with random variables.

Example: The meteorology department wants to tell whether its going to rain or not today. And they have the weather data collected from various devices. The data has attributes of wind, pressure, humidity and precipitation.

If you pick one value from the series of Humidity values, how certainly can it tell when it is going to rain or not? Is the entropy associated with Humidity random variable.

Photo by Nicolas Prieto on Unsplash

If entropy is too high, it indicates the Humidity variable has not potential to tell that its going to rain or not. If entropy is less, then Humidity is a good variable to be considered in further analysis.

2. Outliers

Outlier is a measure of unusualness associated with a random variable.

Though Humidity has a good potential to solve the problem, not all of its values can be useful to the calculation. Create a boxplot and determine the number of outliers.

If more percentage of values are lying outside the box, then the final outcome would be less accurate. In such a case, we need to discard the Humidity variable. Take one more variable and start with Entropy test.

3. Covariance

Covariance is a measure of relationship between two variables. How variable X changes when variable Y changes. X and Y may have different units of measurements.

Example, if Humidity decrease as Wind increases, then there is a relationship between Humidity and Wind. This relationship adds more value in solving the problem.

How many variables are there that have covariance with at least one other variable is the count we need to measure. Higher this count, more evidence we can derive towards the final outcome.

Good Dataset:

More number of variables that have strong covariance with few/more other variables.

Bad Dataset:

Less number of variables that have strong covariance with few other variables.
More number of variables that have weak covariance with many other variables.

A possible outcome of this assessment could like this:

Humidity has potential to certainly tell it rains or not.
Wind has potential to certainly tell it rains or not.
Most of the values of Humidity & Wind can participate in the calculation. The accuracy is within the acceptable limits.
Humidity and Wind Together has more potential to drive the decision whether it rains or not.

Finally you need to ask these questions to yourself, and feel satisfied with the answers:

How certain are the variable?
How much of this is useful?
How many variables are related?

Originally published at https://www.meritedin.com.

Not All Data is Useful. An Insight into Data Fitment Analysis.

1. Entropy

2. Outliers

3. Covariance

Leave a Reply Cancel reply