Hi, all.

In many studies, we as data scientists will not be able to enjoy all the data of the domain of our study in our analyzes - also called a population. Among the many factors responsible for this, the ethical, economic or geographic factors stand out. To minimize this gap, we performed a defined sampling selection process, the result of an activity known as sampling.

Sampling can be defined as a technique or even a set of procedures/activities needed to describe and select samples, either randomly or not. When the process is done correctly, it is a factor responsible for determining the representativeness of the sample.

When we speak of representativeness, we are saying that the sample used in our experiment should have the same characteristics of the observed population. For this, there are two characteristics that we consider: quantity and quality. The larger the sample number, according to the law of large numbers, the better. However, if you have a population with heterogeneous characteristics, your sample should be heterogeneous as well.

Of course! If you have a population of 1000 people, of which 500 are female and 500 are males, if your sample is composed of 500 males, the female universe (quite representative) will not be observed. Therefore, their conclusions can’t generalize the actions of the population - a process called inference. So, this example is as "Garbage in, garbage out": nonsense input data produce nonsense output.

Understand: although we use samples in our experiments, we want to provide solutions for the population with complete and not just for that sampled group. I use the example of blood analysis. When we perform a blood glucose test, only one sample is taken from our blood and not our whole blood. However, the result of the examination is on us (all blood) and not just for that specific sample.

There are basically two types of sampling: probabilistic and non-probabilistic. In probabilistic sampling, each element of the population has a known and non-zero chance of being selected to compose the sample. In non-probabilistic sampling, the selection of the population elements to compose the sample often depends on the researcher's judgment. We will discuss the probabilistic type here because in most of the courses and examples we find on the web about data science, it is the predominant approach.

Among the most common plans applied to the sampling process, stand out: Simple, Systematic and Stratified.

In simple random sampling, sample size **n** is randomly selected from among the **N** elements of the sample population, considering that individual **i** has the same probability of being selected as an individual from population **N**.

The systematic sample is a probabilistic non-random sampling process, obeying a probability criterion that is established through a random process of choosing the first sample unit. The sample units are selected from a rigid and pre-established scheme of systematization, with the purpose of covering the entire population, to obtain a simple and uniform systematic model.

Finally, in the stratified sample, existing information about the population is used to make the sampling process more efficient. Based on the previous example, if the population of which we wish to study consists of 800 women and 200 men, it is desired that our sample represents the proportion between classes.

Even with the use of several techniques to obtain a sample that is representative of the desired population (besides those presented here), no sample represents the population perfectly. The use of a sample implies accepting a margin of error called the sampling error, which means the difference between a sample result and the true population result.

Most of the examples we find on the Internet teach us how to make a simple sample selection. Even using very interesting libraries for this process, such as the **train_test_split** module of **sklearn.model_selection**. For the execution of our experiment, we will use the Iris Dataset.

Iris Dataset is one of the most well-known in the world of machine learning. It is behaved by three classes and four other descriptor attributes and has 150 elements, 50 of which are for each class. Note that the Iris dataset is a naturally balanced dataset. However, if we make a simple sample selection, the samples used for the training of the model and for the test will not be balanced. Let's go to practice.

Figure 1 shows the libraries I used in the example. In addition to **skilearn**, I also imported the numpy and the module frequency from **scipy.stats**. The **freqitem** will be used to observe the distribution of sample elements.

**Figure 1: Loading libraries.**

In Figure 2, I used a variable named date and assigns the Iris dataset. In addition, I separated the descriptors (data that will be used for the machine learning) and the labels (classes).

**Figure 2: Loading dataset.**

The data selection process for training the model and then testing it is shown in Figure 3. If you have already taken a web course on data science or machine learning again, this method is not strange to you. The **train_test_split** receives the descriptors and labels. It will be used for generations (separation) of data for training and for testing according to the parameter **test_size** (defined as 30% for the test) and the parameter **random_state**, which is used as a seed for the process of random data separation.

**Figure 3: Spliting dataset.**

The next step will be to see how the **train_test_split** performed the split of the samples for training and testing. Figure 4 shows the command executed for this operation.

**Figure 4: Showing frequency.**

The result is shown in Figure 5. In the first column are the classes, represented by the numbers 1, 2 and 3. In the second column, the frequency of the items in each one.

**Figure 5: Showing result.**

What I would like to present to you is just that. Do you remember that Iris dataset data has a balanced distribution (50 elements for each of the classes)? With the parameters set in the **train_test_split** function, the training and test samples are not fully balanced - although approximate.

To be able to separate the samples for training and to test in a balanced way, we only need to insert the attribute **stratify** with the variable used for the label values in the **train_test_split** function (Figure 6).

**Figure 6: Adding new parameter.**

Now execute the script again and notice what happened to your sample (Figure 7). The data were separated in a balanced way (35 elements for each class). You can also view the data that will be used in the test step. You'll see they're balanced, too - of course, is not it?

**Figure 7: Showing new result.**

But what happens if the dataset you will be using in your experiment is not balanced like the Iris dataset? Do not worry. The stratification method will perform the balancing of the samples so that it represents well to the analyzed population.

**Important**: It is very important for your experiment that the selected sample has the same characteristics of the population. This way, you will be able to reduce the sample error and you can perform inference on the population.

Best regards.

**Reading suggestions**

Choosing a sampling method: http://changingminds.org/explanations/research/sampling/choosing_sa...

Law of large numbers: https://www.dartmouth.edu/~chance/teaching_aids/books_articles/prob...

Population Sampling Techniques: https://explorable.com/population-sampling

6 Sampling Techniques: How to Choose a Representative Subset of the Population: https://blog.socialcops.com/academy/resources/6-sampling-techniques...

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

**Technical**

- Free Books and Resources for DSC Members
- Learn Machine Learning Coding Basics in a weekend
- New Machine Learning Cheat Sheet | Old one
- Advanced Machine Learning with Basic Excel
- 12 Algorithms Every Data Scientist Should Know
- Hitchhiker's Guide to Data Science, Machine Learning, R, Python
- Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Pyth...
- How to Automatically Determine the Number of Clusters in your Data
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- Fast Combinatorial Feature Selection with New Definition of Predict...
- 10 types of regressions. Which one to use?
- 40 Techniques Used by Data Scientists
- 15 Deep Learning Tutorials
- R: a survival guide to data science with R

**Non Technical**

- Advanced Analytic Platforms - Incumbents Fall - Challengers Rise
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- How to Become a Data Scientist - On your own
- 16 analytic disciplines compared to data science
- Six categories of Data Scientists
- 21 data science systems used by Amazon to operate its business
- 24 Uses of Statistical Modeling
- 33 unusual problems that can be solved with data science
- 22 Differences Between Junior and Senior Data Scientists
- Why You Should be a Data Science Generalist - and How to Become One
- Becoming a Billionaire Data Scientist vs Struggling to Get a $100k Job
- Why do people with no experience want to become data scientists?

**Articles from top bloggers**

- Kirk Borne | Stephanie Glen | Vincent Granville
- Ajit Jaokar | Ronald van Loon | Bernard Marr
- Steve Miller | Bill Schmarzo | Bill Vorhies

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives**: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central