*Guest blog post by Dmitry Petrov. Originally posted here. *

There is a feature I really like in Apache Spark. **Spark can process data out of memory in my local machine even without a cluster.** Good news for those who process data sets bigger than the memory size that currently have. From time to time, I have this issue when I work with hypothesis testing.

**For hypothesis testing I usually use statistical bootstrapping techniques. This method does not require any statistical knowledge and is very easy to understand.** Also, this method is very simple to implement. There are no normal distributions and student distributions from your statistical courses, only some basic coding skills. Good news for those who doesn’t like statistics. Spark and bootstrapping is a very powerful combination which can help you check hypotheses in a large scale.

The most common application with bootstrapping is calculating confidence intervals and you can use these confidence intervals as a part of the hypotheses checking process. There is a very simple idea behind bootstrapping – sample your data set size N for hundreds or even thousands times with the replacement (this is important) and calculate the estimated metrics for each of the hundreds\thousands subset. This process gives you a histogram which is an **actual distribution for your data**. Then, you can use this actual distribution for hypothesis testing.

The beauty of this method is the actual distribution histogram. In a classical statistical approach, you need to approximate a distribution of your data by normal distribution and calculate z-scores or student-scores based on theoretical distributions. With the actual distribution from the first step it is easy to calculate 2.5% percentile and 97.5% percentiles and this would be your actual confidence interval. That’s it! **Confident interval with almost no math.**

Choosing right hypotheses is only the tricky part in this analytical process. This is a question you ask the data and you cannot automate that. Hypotheses testing is a part of the analytical process and isn’t usual for machine learning experts. **In machine learning you ask an algorithm to build a model\structure which is sometimes called hypothesis and you are looking for the best hypotheses which correlates your data and labels.**

**In the analytics process, knowing the correlation is not enough**, you should know the hypothesis from the get-go and the question is – if the hypothesis is correct and what is your level of confidence.

If you have a correct hypotheses it is easy to check the hypotheses based on the bootstrapping approach. For example let’s try to check the hypothesis in which we take an average for some feature in your dataset that is equal to 30.0. We should start with a null hypothesis H0 which we try to reject and an alternative hypothesis H1:

H0: mean(A) == 30.0

H1: meanA() != 30.0

If we fail to reject H0 we will take this hypothesis as ground truth. That’s what we need. If we don’t – then we should come up with a better hypothesis (mean(A) == 40).

For the hypotheses checking we can simply calculate the confidence interval for dataset A by sampling and calculating 95% confidence interval. If the interval does not contain 30.0 then your hypotheses H0 was rejected.

Obviously, this confident interval starts with 2.5% and ends 97.5% which gives us 95% of the items between this interval. In the sorted array of our observations we should find 2.5% and 97.5% percentiles: p1 and p2. If p1 <= 30.0 <= p2, then we weren’t able to reject H0. So, we can suppose that H0 is the truth.

Implementation of bootstrapping in this particular case is straight forward.

import scala.util.Sorting.quickSort

def getConfInterval(input: org.apache.spark.rdd.RDD[Double], N: Int, left: Double, right:Double)

: (Double, Double) = {

// Simulate by sampling and calculating averages for each of subsamples

val hist = Array.fill(N){0.0}

for (i <- 0 to N-1) {

hist(i) = input.sample(withReplacement = true, fraction = 1.0).mean

}

// Sort the averages and calculate quantiles

quickSort(hist)

val left_quantile = hist((N*left).toInt)

val right_quantile = hist((N*right).toInt)

return (left_quantile, right_quantile)

}

Because I did not find any good open datasets for the large scale hypotheses testing problem, let’s use skewdata.csv dataset from the book “Statistics: An Introduction Using R”. You can find this dataset in this archive. It is not perfect but will work in a pinch.

val dataWithHeader = sc.textFile("zipped/skewdata.csv")

val header = dataWithHeader.first

val data = dataWithHeader.filter( _ != header ).map( _.toDouble )

val (left_qt, right_qt) = getConfInterval(data, 1000, 0.025, 0.975)

val H0_mean = 30

if (left_qt < H0_mean && H0_mean < right_qt) {

println("We failed to reject H0. It seems like H0 is correct.")

} else {

println("We rejected H0")

}

**We have to understand the difference between “filed to reject H0” and “proof H0”.** A failing to reject a hypothesis gives you a pretty strong level of evidence that the hypothesis is correct and you can use this information in your decision making process but this is not an actual proof.

Another type of hypotheses – check if the means of the two datasets are different. This leads us to the usual design of experiment questions – if you apply some change in your web system (user interface change for example) would your click rate change in a positive direction?

Let’s create a hypothesis:

H0: mean(A) == mean(B)

H1: mean(A) > mean(B)

It is not easy to find H1 for this hypothesis which we can prove. Let’s change this hypothesis around a little bit:

Ho’: mean(A-B) == 0

H1: mean(A-B) > 0

Now we can try to reject H0′.

def getConfIntervalTwoMeans(input1: org.apache.spark.rdd.RDD[Double],

input2: org.apache.spark.rdd.RDD[Double],

N: Int, left: Double, right:Double)

: (Double, Double) = {

// Simulate average of differences

val hist = Array.fill(N){0.0}

for (i <- 0 to N-1) {

val mean1 = input1.sample(withReplacement = true, fraction = 1.0).mean

val mean2 = input2.sample(withReplacement = true, fraction = 1.0).mean

hist(i) = mean2 - mean1

}

// Sort the averages and calculate quantiles

quickSort(hist)

val left_quantile = hist((N*left).toInt)

val right_quantile = hist((N*right).toInt)

return (left_quantile, right_quantile)

}

We should change 2.5% and 97.5% percentiles in the interval to 5% percentile in the left side only because of one-side (one-tailed) hypothesis testing. And an actual code as an example:

// Let's try to check the same dataset with itself. Ha-ha.

val (left_qt, right_qt) = getConfIntervalTwoMeans(data, data, 1000, 0.05, 0.95)

// A condition was changed because of one-tailed test.

if (left_qt > 0) {

println("We failed to reject H0. It seems like H0 is correct."

} else {

println("We rejected H0")

}

Bootstrapping methods are very simple for understanding and implementation. They are intuitively simple and you don’t need any deep knowledge of statistics. Apache Spark can help you implement these methods in a large scale.

As I mentioned previously it is not easy to find a good open large dataset for hypotheses testing. **Please share with our community if you have one or come across one.**

My code is shared in this Scala file.

**About the Author**:

*Dmitry Petrov is a Data Scientist at Microsoft (Bellevue, Washington), working with the BingAds, Relevance and Revenue team. He earned his PhD in computer science from Saint Petersburg State Electrotechnical University. He is also a scientific author, who received awards and developed a patent.*

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**New Books and Resources for DSC Members** - [See Full List]

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central