Basic Statistics Concepts Every Data Scientist Should know

Introduction

Data science is a multidisciplinary blend of data inference, algorithm development, and technology in order to solve analytically complex problems. At the core is data. Troves of raw information, streaming in and stored in enterprise data warehouses. Much to learn by mining it. Advanced capabilities we can build with it. Data science is ultimately about using this data in creative ways to generate business value

The broader fields of understanding what data science includes mathematics, statistics, computer science and information science. For career as Data Scientist, you need to have a strong background in statistics and mathematics. Big companies will always give preference to those with good analytical and statistical skills.

In this blog, we will be looking at the basic statistical concepts which every data scientists must know. Let’s understand them one by one in the next section.

Role of Statistics in Data Science

Before beginning with 5 most important statistical concepts, let us try to understand the importance of statistics in data science first!

The role of statistics in Data Science is as important as computer science. This yields, in particular, for the areas of data acquisition and enrichment as well as for advanced modelling needed for prediction.

Only complementing and/or combining mathematical methods and computational algorithms with statistical reasoning, particularly for Big Data, will lead to scientific results based on suitable approaches. Ultimately, only a balanced interplay of all sciences involved will lead to successful solutions in Data Science.

Important Concepts in Data Science

1. Probability Distributions

A probability distribution is a function that describes the likelihood of obtaining the possible values that a random variable can assume. In other words, the values of the variable vary based on the underlying probability distribution.

Suppose you draw a random sample and measure the heights of the subjects. As you measure heights, you can create a distribution of heights. This type of distribution is useful when you need to know which outcomes are most likely, the spread of potential values, and the likelihood of different results.

2. Dimensionality Reduction

In machine learning classification problems, there are often too many factors on the basis of which we do the final classification. These factors are basically variables or features. The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features can have a correlation, and hence redundant. This is where dimensionality reduction algorithms come into play. Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It consists of feature selection and feature extraction.

An intuitive example of dimensionality reduction can be a simple e-mail classification problem, where we need to classify whether the e-mail is spam or not. This can involve a large number of features, such as whether or not the e-mail has a generic title, the content of the e-mail, whether the e-mail uses a template, etc. However, some of these features may overlap. In another condition, a classification problem that relies on both humidity and rainfall, we can then club them into just one underlying feature, since both of the aforementioned are correlated to a high degree. Hence, we can reduce the number of features in such problems. A 3-D classification problem can be hard to visualize, whereas we can visualise a 2-D in a 2-dimensional space and a 1-D problem to a simple line.

3. Over and Under-Sampling

Oversampling and undersampling are techniques in data mining and data analytics to modify unequal data classes to create uniform data sets. Also, we can call oversampling and undersampling as resampling.

When one class of data has under representation minority class in the data sample, oversampling techniques may be useful to duplicate these results for a more uniform amount of positive results in training. Oversampling is important when data at hand is insufficient. A popular oversampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.

Conversely, if a class of data has an over-representation as majority class, undersampling may be useful to balance it with the minority class. Undersampling is important when the data at hand is sufficient. Common methods of undersampling include cluster centroids and Tomek links, both of which target potential overlapping characteristics within the collected data sets to reduce the amount of majority data.

In both oversampling and undersampling, data duplication is not really useful. Generally, oversampling is preferable as undersampling can result in the loss of important data. Undersampling is suggested when the amount of data collected is larger than ideal and can help data mining tools to stay within the limits of what they can effectively process.

4. Bayesian Statistics

Bayesian statistics is a particular approach to applying probability to statistical problems. It provides us with mathematical tools to update our beliefs. These are about random events in light of seeing new data or evidence about those events.

In particular Bayesian inference interprets probability as a measure of believability or confidence. It is what an individual may possess about the occurrence of a particular event.

We may have a prior belief about an event, but our beliefs are likely to change when evidence is brought to light. Bayesian statistics gives us a mathematical means of incorporating our prior beliefs, and evidence, to produce new posterior beliefs.

Bayesian statistics provides us with mathematical tools to rationally update our beliefs in light of new data or evidence.

This is in contrast to another form of statistical inference, known as classical or frequentist statistics. It assumes that probabilities are the frequency of particular random events occurring in the long run of repeated trials.

For example, as we roll a fair (i.e. unweighted) six-sided die repeatedly, we would see that each number on the die tends to come up 1/6 of the time.

Frequentist statistics assumes that probabilities are the long-run frequency of random events in repeated trials.

When carrying out statistical inference, that is, inferring statistical information from probabilistic systems, the two approaches — frequentist and Bayesian — have very different philosophies.

Frequentist statistics tries to eliminate uncertainty by providing estimates. Bayesian statistics tries to preserve and refine uncertainty by adjusting individual beliefs in light of new evidence.

5. Descriptive Statistics

This is the most common of all forms. In business, it provides the analyst with a view of key metrics and measures within the business. Descriptive statistics include exploratory data analysis, unsupervised learning, clustering and basic data summaries. Descriptive statistics have many uses, most notably helping us get familiar with a data set. It is the starting point for any analysis. Often, descriptive statistics help us arrive at hypotheses to be tested later with more formal inference.

Descriptive statistics are very important because if we simply presented our raw data it would be hard to visualise what the data was showing, especially if there was a lot of it. Descriptive statistics, therefore, enables us to present the data in a more meaningful way, which allows simpler interpretation of the data. For example, if we had the results of 1000 students’ marks for a particular for the SAT exam, we may be interested in the overall performance of those students. We would also be interested in the distribution or spread of the marks. Descriptive statistics allow us to do this.

Let’s take another example like a data analyst could have data on a large population of customers. Understanding demographic information on their customers (e.g. 20% of our customers are self-employed) would be categorized as “descriptive analytics”. Utilizing effective visualization tools enhances the message of descriptive analytics.

Summary

We had a look at important statistical concepts in data science. Statistics is one of the important components in data science. There is a great deal of overlap between the fields of statistics and data science, to the point where many definitions of one discipline could just as easily describe the other discipline. However, in practice, the fields differ in a number of key ways. Statistics is a mathematically-based field which seeks to collect and interpret quantitative data. In contrast, data science is a multidisciplinary field which uses scientific methods, processes, and systems to extract knowledge from data in a range of forms. Data scientists use methods from many disciplines, including statistics. However, the fields differ in their processes, the types of problems studied, and several other factors.

If you want to read more about data science, read our Data Science Blogs