**Exploratory Data Analysis (EDA)** is an approach/philosophy for data analysis that employs a variety of techniques (graphical and quantitative) to better understand data. It is easy to get lost in the visualizations of EDA and to also lose track of the purpose of EDA. EDA aims to make the downstream analysis easier. To put EDA in context, the Data Science steps are: Obtain data, Clean and load data; Exploratory Data Analysis; Model building; Model evaluation; Data visualization and presentation

The Objectives of EDA are to discover underlying patterns, spot anomalies, frame the hypothesis and check assumptions with the aim to find a good fitting model (if one exists). At a more granular level, EDA involves understanding the relationship between variables including determining relationships among the explanatory variables; assessing the relationships between explanatory and outcome variables (direction and rough size estimates); the presence of outliers; a ranking of the important explanatory variables; conclusions as to whether individual explanatory variables are statistically significant.

In this post, we present a systematic approach to EDA (based on the sources listed below) to present EDA techniques in a concise manner.

EDA techniques are either **graphical or quantitative**. Each of these techniques are in turn, either univariate or multivariate (usually just bivariate). Quantitative methods normally involve calculation of summary statistics. Graphical methods summarize the data in a diagrammatic or visual way. Univariate methods look at one variable (data column) at a time, while multivariate methods look at two or more variables at a time to explore relationships. Usually, multivariate EDA will be bivariate (looking at exactly two variables). Thus, the four types of EDA techniques are Univariate non-graphical; Univariate graphical; Multivariate non-graphical; Multivariate graphical. Non-graphical and graphical methods complement each other. We can see graphical methods as more qualitative (providing subjective analysis) vs quantitative methods as objective.

If we are focusing on data from observation of a single variable on n subjects, i.e. a sample of size n, we also need to look graphically at the **distribution of the sample**. Given a large enough sample size, we assume that the distribution is normal. A more detailed explanation is HERE. There are exceptions to this idea – for example – distributions could evolve over time, the distribution could be unknown etc but for most cases, the normality conditions apply.

Univariate non-graphical EDA techniques are concerned with understanding the underlying sample distribution and make observations about the population. This also involves **Outlier detection**. For **univariate categorical data**, we are interested in the range and the frequency. **Univariate EDA for quantitative data** involves making preliminary assessments about the population distribution of the variable using the data from the observed sample. The characteristics of the population distribution inferred include **center, spread, modality, shape and outliers**. Measures of **central tendency** include Mean, Median, Mode. The most common measure of central tendency is the mean. For skewed distribution or when there is concern about outliers, the median may be preferred. **Measures of spread** include variance, standard deviation, and interquartile range. Spread is an indicator of how far away from the center we are still likely to find data values. Univariate EDA also involves finding the skewness (measure of asymmetry) and Kurtosis (measure of peakedness relative to a Gaussian shape).

For graphical analysis of univariate categorical data, **histograms** are typically used. The histogram represents the frequency (count) or proportion (count/total count) of cases for a range of values. Typically, between about 5 and 30 bins are chosen. Histograms are one of the best ways to quickly learn a lot about your data, including central tendency, spread, modality, shape and outliers. **Stem and Leaf** plots could also be used for the same purpose. **Boxplots** can also be used to present information about the central tendency, symmetry and skew, as well as outliers. Quantile normal plots or QQ plots and other techniques could also be used here.

**Multivariate non-graphical EDA** techniques generally show the relationship between two or more variables in the form of either cross-tabulation or statistics. For each combination of categorical variable (usually explanatory) and one quantitative variable (usually outcome), we can create a statistic for a quantitative variables separately for each level of the categorical variable, and then compare the statistics across levels of the categorical variable. Comparing the means is an informal version of ANOVA. Comparing medians is a robust informal version of one-way ANOVA. (adapted from source. For two quantitative variables, we can calculate co-variance and/or correlation. When we have many quantitative variables, we typically calculate the pairwise covariances and/or correlations and assemble them into a matrix.

**For categorical multivariate quantities**, the most commonly used graphical technique is the **barplot** with each group rep-resenting one level of one of the variables and each bar within a group representing the levels of the other variable. For each category, we could have side-by-side boxplots or Parallel box plots. For two **quantitative multivariate variables**, the basic graphical EDA technique is **the scatterplot** which has one variable on the x-axis, one on the y-axis and a point for each case in your dataset. Typically, the explanatory variable goes on the X axis. Additional categorical variables can be accommodated by the use of colour or symbols.

EDA is a complex and subjective approach. In this post, we have tried to discuss a set of steps to run EDA techniques so that they provide inputs to the subsequent stages.

Chapter 4 EDA chapter by howard seltman

Image source: HDIUK-Handheld-Magnifier-Spyglass-Magnifying

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central