Hi, all.

In many studies, we as data scientists will not be able to enjoy all the data of the domain of our study in our analyzes - also called a population. Among the many factors responsible for this, the ethical, economic or geographic factors stand out. To minimize this gap, we performed a defined sampling selection process, the result of an activity known as sampling.

Sampling can be defined as a technique or even a set of procedures/activities needed to describe and select samples, either randomly or not. When the process is done correctly, it is a factor responsible for determining the representativeness of the sample.

When we speak of representativeness, we are saying that the sample used in our experiment should have the same characteristics of the observed population. For this, there are two characteristics that we consider: quantity and quality. The larger the sample number, according to the law of large numbers, the better. However, if you have a population with heterogeneous characteristics, your sample should be heterogeneous as well.

Of course! If you have a population of 1000 people, of which 500 are female and 500 are males, if your sample is composed of 500 males, the female universe (quite representative) will not be observed. Therefore, their conclusions can’t generalize the actions of the population - a process called inference. So, this example is as "Garbage in, garbage out": nonsense input data produce nonsense output.

Understand: although we use samples in our experiments, we want to provide solutions for the population with complete and not just for that sampled group. I use the example of blood analysis. When we perform a blood glucose test, only one sample is taken from our blood and not our whole blood. However, the result of the examination is on us (all blood) and not just for that specific sample.

There are basically two types of sampling: probabilistic and non-probabilistic. In probabilistic sampling, each element of the population has a known and non-zero chance of being selected to compose the sample. In non-probabilistic sampling, the selection of the population elements to compose the sample often depends on the researcher's judgment. We will discuss the probabilistic type here because in most of the courses and examples we find on the web about data science, it is the predominant approach.

Among the most common plans applied to the sampling process, stand out: Simple, Systematic and Stratified.

In simple random sampling, sample size **n** is randomly selected from among the **N** elements of the sample population, considering that individual **i** has the same probability of being selected as an individual from population **N**.

The systematic sample is a probabilistic non-random sampling process, obeying a probability criterion that is established through a random process of choosing the first sample unit. The sample units are selected from a rigid and pre-established scheme of systematization, with the purpose of covering the entire population, to obtain a simple and uniform systematic model.

Finally, in the stratified sample, existing information about the population is used to make the sampling process more efficient. Based on the previous example, if the population of which we wish to study consists of 800 women and 200 men, it is desired that our sample represents the proportion between classes.

Even with the use of several techniques to obtain a sample that is representative of the desired population (besides those presented here), no sample represents the population perfectly. The use of a sample implies accepting a margin of error called the sampling error, which means the difference between a sample result and the true population result.

Most of the examples we find on the Internet teach us how to make a simple sample selection. Even using very interesting libraries for this process, such as the **train_test_split** module of **sklearn.model_selection**. For the execution of our experiment, we will use the Iris Dataset.

Iris Dataset is one of the most well-known in the world of machine learning. It is behaved by three classes and four other descriptor attributes and has 150 elements, 50 of which are for each class. Note that the Iris dataset is a naturally balanced dataset. However, if we make a simple sample selection, the samples used for the training of the model and for the test will not be balanced. Let's go to practice.

Figure 1 shows the libraries I used in the example. In addition to **skilearn**, I also imported the numpy and the module frequency from **scipy.stats**. The **freqitem** will be used to observe the distribution of sample elements.

**Figure 1: Loading libraries.**

In Figure 2, I used a variable named date and assigns the Iris dataset. In addition, I separated the descriptors (data that will be used for the machine learning) and the labels (classes).

**Figure 2: Loading dataset.**

The data selection process for training the model and then testing it is shown in Figure 3. If you have already taken a web course on data science or machine learning again, this method is not strange to you. The **train_test_split** receives the descriptors and labels. It will be used for generations (separation) of data for training and for testing according to the parameter **test_size** (defined as 30% for the test) and the parameter **random_state**, which is used as a seed for the process of random data separation.

**Figure 3: Spliting dataset.**

The next step will be to see how the **train_test_split** performed the split of the samples for training and testing. Figure 4 shows the command executed for this operation.

**Figure 4: Showing frequency.**

The result is shown in Figure 5. In the first column are the classes, represented by the numbers 1, 2 and 3. In the second column, the frequency of the items in each one.

**Figure 5: Showing result.**

What I would like to present to you is just that. Do you remember that Iris dataset data has a balanced distribution (50 elements for each of the classes)? With the parameters set in the **train_test_split** function, the training and test samples are not fully balanced - although approximate.

To be able to separate the samples for training and to test in a balanced way, we only need to insert the attribute **stratify** with the variable used for the label values in the **train_test_split** function (Figure 6).

**Figure 6: Adding new parameter.**

Now execute the script again and notice what happened to your sample (Figure 7). The data were separated in a balanced way (35 elements for each class). You can also view the data that will be used in the test step. You'll see they're balanced, too - of course, is not it?

**Figure 7: Showing new result.**

But what happens if the dataset you will be using in your experiment is not balanced like the Iris dataset? Do not worry. The stratification method will perform the balancing of the samples so that it represents well to the analyzed population.

**Important**: It is very important for your experiment that the selected sample has the same characteristics of the population. This way, you will be able to reduce the sample error and you can perform inference on the population.

Best regards.

**Reading suggestions**

Choosing a sampling method: http://changingminds.org/explanations/research/sampling/choosing_sa...

Law of large numbers: https://www.dartmouth.edu/~chance/teaching_aids/books_articles/prob...

Population Sampling Techniques: https://explorable.com/population-sampling

6 Sampling Techniques: How to Choose a Representative Subset of the Population: https://blog.socialcops.com/academy/resources/6-sampling-techniques...

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central