Not All Data is Useful. An Insight into Data Fitment Analysis.

There is a tendency, even among people who should know better, to view the data that one has access to in an organization as being of perfect quality and utility. In reality, the data that any organization collects over time can range from being highly useful to a waste of computer cycles and processing effort, and an effective part of any data strategy is understanding what is a treasure and what is, to put it simply, an eyesore.. 

1. Entropy

Entropy is a measure of uncertainty associated with random variables.

Example: The meteorology department wants to tell whether it’s going to rain or not today. And they have the weather data collected from various devices. The data has attributes of wind, pressure, humidity and precipitation.

If you pick one value from the series of Humidity values, how certainly can it tell when it is going to rain or not? Is the entropy associated with Humidity random variable.

Photo by Nicolas Prieto on Unsplash

If entropy is too high, it indicates the Humidity variable has not potential to tell that it’s going to rain or not. If entropy is less, then Humidity is a good variable to be considered in further analysis.

2. Outliers

Outlier is a measure of unusualness associated with a random variable.

Though Humidity has a good potential to solve the problem, not all of it’s values can be useful to the calculation. Create a boxplot and determine the number of outliers.

If more percentage of values are lying outside the box, then the final outcome would be less accurate. In such a case, we need to discard the Humidity variable. Take one more variable and start with Entropy test.

3. Covariance

Covariance is a measure of relationship between two variables. How variable X changes when variable Y changes. X and Y may have different units of measurements.

Example, if Humidity decrease as Wind increases, then there is a relationship between Humidity and Wind. This relationship adds more value in solving the problem.

How many variables are there that have covariance with at least one other variable is the count we need to measure. Higher this count, more evidence we can derive towards the final outcome.

Good Dataset:

More number of variables that have strong covariance with few/more other variables.

Bad Dataset:

  • Less number of variables that have strong covariance with few other variables.
  • More number of variables that have weak covariance with many other variables.

A possible outcome of this assessment could like this:

  • Humidity has potential to certainly tell it rains or not.
  • Wind has potential to certainly tell it rains or not.
  • Most of the values of Humidity & Wind can participate in the calculation. The accuracy is within the acceptable limits.
  • Humidity and Wind Together has more potential to drive the decision — whether it rains or not.

Finally you need to ask these questions to yourself, and feel satisfied with the answers:

  1. How certain are the variable?
  2. How much of this is useful?
  3. How many variables are related?

Originally published at https://www.meritedin.com.

Views: 275

Tags: #datascience, dsc_dataquality


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Rene Eekelder on Tuesday

Great to focus on data quality and you make some strong points!

I would like to add some points here. Perfectly clean data are only to be expected in laboratory cases. In the real world, many times the question is: is the data quality good enough? And is it possible to manipulate the data thus, that the data quality becomes acceptable?

In the case of outliers, it’s often necessary to know the reason of having outliers. The reason might be associated with the data itself or the measurement.

In case of outliers, problems may arise in the measurement chain (connection has been broken, battery flat, containments like dust between the object of measurement and the device or on the device itself, mistakes made by humans (wrong data-entry, copying, etc.).

In the case of a flat battery, we might easily see the result as the measurements have no values or a constant value of zero during the time the battery of the device was flat. We might possibly discard this period in our dataset, and keeping the rest of our dataset. And we might need to filter data, because measured values consist of wanted data (humidity) and unwanted data (due to interference and other containments)

It’s also possible that the values are real outliers. In your example of humidity, one day or a period of days might be irregularly humid compared to other periods in the time of year and/or previous years. Maybe you want to erase these data values out of your modelling. But in some cases, you have to deal with these outliers in your final analysis: say that you measure weather conditions for crop grow: some crops can’t stand high or low levels of humidity.

Comment by Tom Wolfer on September 9, 2021 at 8:12am

Topic: Statistics - KDnuggets contains a free outlier analysis boxplot tool that analysts can use to isolate and provide insights into unusual behaviour based on a key measure.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service