Home » Technical Topics » Data Science

How To Differentiate a Dataset If It Has Normal Distribution

9540964480

The distribution of data means the way the data gets spread out. This article talks about some essential concepts of the normal distribution:

  • How to measure normality
  • Ways to transform a dataset to fit the normal class distribution
  • How to use the normal distribution to showcase naturally distributed phenomena and provide statistical insights

Let€™s get started!

Suppose you belong to the field of statistics. In that case, you know how vital data distribution is because we always sample from a population where you have no idea about full distribution. As a result, the distribution of our sample might limit the statistical techniques available to us.

Looking at the normal distribution, it is a frequently perceived continuous probability distribution.

When a database meets the normal distribution, you can employ other techniques to explore the data more.

  • Knowledge about the percentage of data in each standard deviation
  • Linear least-squares regression
  • Inference based on the sample mean

In some cases, it can be beneficial to change a skewed dataset to observe the normal distribution. It will be more relevant when your data is usually distributed for some distortion. 

Here are the basic features of the normal distribution:

  • Symmetric bell shape
  • Equal Mean and median at the center of the distribution
  • ‰ˆ68% of the comedown within 1 standard deviation of the mean
  • ‰ˆ95% of the data come down within 2 deviations of the mean
  • ‰ˆ99.7% of the data falls between 3 standard deviations of the mean

9535515691

M.W. Toews via Wikipedia

Important terms you need to know as a general overview of the normal distribution:

  • Normal Distribution: It is a symmetric probability distribution frequently used to represent real-valued random variables. Also called the bell-curved or Gaussian distribution.
  • Standard Deviation: It measures the amount of variation or dispersion of a set of values. It is also calculated as the square root of variance.
  • Variance: It is the distance from the mean of each data point

Ways to Use Normal Distribution

If the dataset you have does not conform to the normal distribution, you could apply these tips.

  • Collect more data: Even a tiny sample size lacking quality could distort your customarily distributed dataset. As a solution, collecting more data is the key.
  • Reduce sources of variance: Reducing the outliers can help with the normal distribution of data.
  • Apply a power transform:  You can choose to apply the Box-Cox method for skewed data, which refers to taking the square root and the log of the observation.

Let€™s also overview some normality measures and how you would use them in a Data science project.

Skewness

It is a measure of asymmetry relative to the mean.

9535516682

Source: Rodolfo Hermans via Wikipedia

The above graph has negative skewness. That means that the tail of the distribution is longer on the left side. The counterintuitive thing is that most of the data points are clustered on the right side. Make sure you are not getting confused with right or positive skewness that might get represented by this graph€™s mirror image.

A Brief on How to Use Skewness

It is a significant factor in model performance. You can use skew from the scipy stats module to measure skewness.

9535517459

Source:  SciPy

The skewness measure can drive us to the potential deviation in model performance across all the feature values. A positively skewed feature for example the second array in the above image can enable better performance on lower values. 

Kurtosis

The original meaning of Kurtosis is a measure of the tailedness of the distribution. It is typically measured relative to 0, the kurtosis value of the normal distribution with Fisher€™s definition. A positive kurtosis value identifies €œfatter€ tails.

The Laplace Distribution has kurtosis > 0. via John D. Cook Consulting.

9535517870

Via John D. Cook Consulting.

A Guide to using Kurtosis

Understanding kurtosis supply a lens to the presence of outliers in a dataset. To measure kurtosis, you can use kurtosis from the scipy.stats module. Negative kurtosis indicates data that is grouped meticulously around the mean with fewer outliers.

9535518481

Via SciPy

A Caution about the Normal Distribution

Various naturally occurring datasets conform to the normal distribution. This claim has been made for everything from IQ to human heights. While normal distribution is drawn from observations of nature and frequently occurs, which is true, we risk oversimplification by applying this assumption too liberally. 

Often the standard model won€™t fit well in the extremes. It also undermines the probability of rare events. 

Calculate the Share of Values within SD

As the amount of data set gets larger and larger, calculating the standard deviation (SD) and the number of values falling within each quarter of the bell-shaped curve becomes difficult. To this end, an empirical rule calculator can make the process faster. This calculator calculates the share of values that fall within a particular SD from the mean or the dataset average. To calculate the percentage of values, we just need to have mean and SD value handy.

Summary

This brief article covered everything about normal distribution€”some fundamental concepts, how to measure them, and how to use them. Make sure not to over-apply normal distribution, or you risk discounting the chances of outliers. Let us know how it helped you in understanding the concepts.