Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (graphical and quantitative) to better understand data. It is easy to get lost in the visualizations of EDA and to also lose track of the purpose of EDA. EDA aims to make the downstream analysis easier. To put EDA in context, the Data Science steps are: Obtain data, Clean and load data; Exploratory Data Analysis; Model building; Model evaluation; Data visualization and presentation
The Objectives of EDA are to discover underlying patterns, spot anomalies, frame the hypothesis and check assumptions with the aim to find a good fitting model (if one exists). At a more granular level, EDA involves understanding the relationship between variables including determining relationships among the explanatory variables; assessing the relationships between explanatory and outcome variables (direction and rough size estimates); the presence of outliers; a ranking of the important explanatory variables; conclusions as to whether individual explanatory variables are statistically significant.
In this post, we present a systematic approach to EDA (based on the sources listed below) to present EDA techniques in a concise manner.
EDA techniques are either graphical or quantitative. Each of these techniques are in turn, either univariate or multivariate (usually just bivariate). Quantitative methods normally involve calculation of summary statistics. Graphical methods summarize the data in a diagrammatic or visual way. Univariate methods look at one variable (data column) at a time, while multivariate methods look at two or more variables at a time to explore relationships. Usually, multivariate EDA will be bivariate (looking at exactly two variables). Thus, the four types of EDA techniques are Univariate non-graphical; Univariate graphical; Multivariate non-graphical; Multivariate graphical. Non-graphical and graphical methods complement each other. We can see graphical methods as more qualitative (providing subjective analysis) vs quantitative methods as objective.
If we are focusing on data from observation of a single variable on n subjects, i.e. a sample of size n, we also need to look graphically at the distribution of the sample. Given a large enough sample size, we assume that the distribution is normal. A more detailed explanation is HERE. There are exceptions to this idea – for example – distributions could evolve over time, the distribution could be unknown etc but for most cases, the normality conditions apply.
Univariate non-graphical EDA techniques are concerned with understanding the underlying sample distribution and make observations about the population. This also involves Outlier detection. For univariate categorical data, we are interested in the range and the frequency. Univariate EDA for quantitative data involves making preliminary assessments about the population distribution of the variable using the data from the observed sample. The characteristics of the population distribution inferred include center, spread, modality, shape and outliers. Measures of central tendency include Mean, Median, Mode. The most common measure of central tendency is the mean. For skewed distribution or when there is concern about outliers, the median may be preferred. Measures of spread include variance, standard deviation, and interquartile range. Spread is an indicator of how far away from the center we are still likely to find data values. Univariate EDA also involves finding the skewness (measure of asymmetry) and Kurtosis (measure of peakedness relative to a Gaussian shape).
For graphical analysis of univariate categorical data, histograms are typically used. The histogram represents the frequency (count) or proportion (count/total count) of cases for a range of values. Typically, between about 5 and 30 bins are chosen. Histograms are one of the best ways to quickly learn a lot about your data, including central tendency, spread, modality, shape and outliers. Stem and Leaf plots could also be used for the same purpose. Boxplots can also be used to present information about the central tendency, symmetry and skew, as well as outliers. Quantile normal plots or QQ plots and other techniques could also be used here.
Multivariate non-graphical EDA techniques generally show the relationship between two or more variables in the form of either cross-tabulation or statistics. For each combination of categorical variable (usually explanatory) and one quantitative variable (usually outcome), we can create a statistic for a quantitative variables separately for each level of the categorical variable, and then compare the statistics across levels of the categorical variable. Comparing the means is an informal version of ANOVA. Comparing medians is a robust informal version of one-way ANOVA. (adapted from source. For two quantitative variables, we can calculate co-variance and/or correlation. When we have many quantitative variables, we typically calculate the pairwise covariances and/or correlations and assemble them into a matrix.
For categorical multivariate quantities, the most commonly used graphical technique is the barplot with each group rep-resenting one level of one of the variables and each bar within a group representing the levels of the other variable. For each category, we could have side-by-side boxplots or Parallel box plots. For two quantitative multivariate variables, the basic graphical EDA technique is the scatterplot which has one variable on the x-axis, one on the y-axis and a point for each case in your dataset. Typically, the explanatory variable goes on the X axis. Additional categorical variables can be accommodated by the use of colour or symbols.
EDA is a complex and subjective approach. In this post, we have tried to discuss a set of steps to run EDA techniques so that they provide inputs to the subsequent stages.
Image source: HDIUK-Handheld-Magnifier-Spyglass-Magnifying