Home » Technical Topics » Data Science

A Gentle Introduction to Non-Parametric Tests

9624965477

What are Non-parametric tests?

Most the statistical tests are optimal under various assumptions like independence, homoscedasticity or normality. However, it might not always be possible to guarantee that the data follows all these assumptions. Non-parametric tests are statistical methods which don€™t need the normality assumption and the normality assumption can be replaced by a more general assumption concerning the distribution function.

Non-parametric and Distribution-free

Often the terms non-parametric and distribution-free are used interchangeably. However, these two terms are not exactly synonymous. A problem becomes parametric or non-parametric depending on whether we allow the parent distribution of the data to depend on a finite number of parameters or keep it more general (e.g. just continuous). Thus, it depends more on how we formulate the problem. Whereas, if the problem does not depend either on the parent distribution or its parameter, then it becomes distribution-free. Hence, both parametric and non-parametric methods may or may not be distribution-free. However, distribution-free procedures were primarily made for non-parametric methods and hence, both the terms are used interchangeably.

When to use Non-parametric tests:

1. When the data does not follow the necessary assumptions like normality.
2. When the sample size is too small. Since, in that case, it becomes difficult for the data to follow the assumptions
3. Data is nominal or ordinal. For example, customer feedback in the form €œStrongly disagree, Disagree, Neutral, Agree, Strongly agree€
4. The data is ranked. For example, customers ranks a list of products
5. The data contains outlier
6. There is a lower bound and upper bound in the measurement process beyond which it just says €œNot measured€ or €œNot detected€

Advantages and disadvantages of Non-parametric tests:

Advantages:

1. It needs fewer assumptions and hence, can be used in a broader range of situations
2. A wide range of data types and even small sample size can analyzed
3. It has more statistical power when the assumptions are violated in the data

Disadvantages:

1. If the assumptions are not violated, statistical power of the test is significantly less than the analogous parametric tests. In a way, if assumptions are not violated, using non-parametric test will be a wastage of data
2. For large sample, it is computationally expensive.

Note that, if the data follows the assumptions (mainly the normality assumption), it is always wise to apply parametric tests. Even in some situations when the normality assumption is not met, if the sample size is large enough, parametric tests can be applied.

Below, we introduce some of the most useful non-parametric tests along with a brief python code.

Nature of hypothesis

Non-parametric test

Parametric counterpart

When to use

Simple Python code*

One sample€™s location (median)

Simple sign ; Wilcoxon signed rank

Student€™s t

Whether the median of the sample is equal to an assumed value in population

#Code for simple sign test

from statsmodels.stats import descriptivestats

stat, p = descriptivestats.sign_test(data1)

print(“single sample sign test p-value”, p)

Paired sample€™s location (median)

Simple sign ;

Wilcoxon signed rank

Paired t

Whether the median of the paired sample    is equal with each other or not

#Paired sample Wilcoxon signed rank test

from scipy.stats import wilcoxon

stat, p = wilcoxon(data1,data2)

print(“Paired sample wilcoxon signed rank test p-value”, p)

Two independent samples€™ location (median)

Wilcoxon signed rank ; Mann-Whitney U

Fisher€™s t

Whether the medians of two independent samples are equal

#Two independent sample Mann-Whitney U test

from scipy.stats import mannwhitneyu

stat, p = mannwhitneyu(data1,data2)

print(“two sample mann whitney test p-value”, p)

General two independent samples

Wald-Wolfowitz run

Whether two independent samples have been drawn from the same distribution

#Wald-Wolfowitz run test

from statsmodels.sandbox.stats.runs import runstest_2samp

stat, p = runstest_2samp(data1,data2)

print(“two sample Wald Wolfowitz run test p-value”, p)

Multiple sample€™s location

Kruskal-Wallis H

ANOVA

Whether more than two samples have been drawn from same distribution

#Kruskal Wallis H test for multiple sample

from scipy.stats import kruskal

stat, p = kruskal(data1, data2, data3)

print(“multiple sample Kruskal Wallis H test p-value”, p)

 *In most of the cases, it is a two tailed test, by default, in the python code

Conclusion:

Statistical tests are powerful tool to learn and compare samples. In this article, the concept of non-parametric tests, when to use it, various advantages, and different non-parametric tests along with their python codes are introduced. Wise utilization of these concepts will help in analyzing a wide range of sample with minimal assumptions.