Recently, I came across with an interesting book on the statistics which has a narration of Ugly Duckling story and correlation of this story with today's DATA or rather BIG DATA ANALYTICS world. This story originally from famous storyteller Hans Christian Andersen

Story goes like this...

The duckling was a big ugly grey bird, so ugly that even a dog would not bite him. The poor duckling was ridiculed, ostracized and pecked by the other ducks. Eventually, it became too much for him and he flew to the swans, the royal birds, hoping that they would end his misery by killing him because he was so ugly. As he stared into the water, though, he saw not an ugly grey bird but a beautiful swan.

Data are much the same. Sometimes they’re just big, grey and ugly and don’t do any of the things that they’re supposed to do. When we get data like these, we swear at them, curse them, peck them and hope that they’ll fly away and be killed by the swans.

Alternatively, we can try to force our data into becoming beautiful swans.

Let me correlate the above narration with the data analysis solution, in 2 ways:

1. Build the process to expose the potential of the data to become beautiful swan

2. Every data need set of assumptions and hypotheses to be tested before it dies as a ugly duckling.

The * process of exposing the potential* of the data is vast from data sourcing, wrangling, cleansing to Exploratory Data Analysis (EDA) and further detailed analysis. These steps should be an integral part of any data product. Though these processes have been for years with most of the data analysis systems and projects, but in recent years it is fairly extended and integrated to external datasets. This external data build an eco-system (support system) around your data to prove the value. e.g. If you want to expose your customer data to a level where it not only show 360 degree view but it also start revealing customer pattern, response with external system. Location play an important role (one of the important part) in this whole process. The spatial mapping, where the customers can be joined with their surrounding. There are various tools which can help you to achieve this spatial mapping from Java GIS libraries to R-Spatial Libraries. read this Spatial Analysis in R at original on DominoData Lab blog

Once you set the mapping right with external datasets, then there are various tools available for wrangling. Eventually, you cleanse the data and do the EDA with this broader dataset, then it becomes customer view with much broader spectrum of external datasets of Geo Location, Economy, GPS-sensor etc. With this, You can start analyzing customers by different segments which you have never captured within your systems. in short, something like this..

Not limited to spatial mapping and analysis but there are many more external data elements which can help your data building process to extend it to much broader range of variables for analysis. With an effective (rather smart) use of these data linkages you can start converting any ugly duckling into meaningful swan.

Let us look at the second part of the solution to * build assumptions and hypotheses*. Given any Data duckling you should start assessing how much of an ugly duckling of a data set you have, and discovering how to turn it into a swan. This is more a statistical solution of conversion (proving and probing) for duckling than a previously explained engineering solution. When assumptions are broken we stop being able to draw accurate conclusions about reality. Different statistical models assume different things, and if these models are going to reflect reality accurately then these assumptions need to be true. This is a step by step process and developed from parametric test i.e. a test that requires data from one of the large catalogue of distributions that statisticians have described. The assumptions that can be tested are:

1 *Normally distributed data*: The rationale behind hypothesis testing relies on having something that is normally distributed (in some cases it’s the sampling distribution, in others the errors in the model).

2 *Homogeneity of variance*: This assumption means that the variances should be the same throughout the data. In designs in which you test several groups of participants this assumption means that each of these samples comes from populations with the same variance. In correlational designs, this assumption means that the variance of one variable should be stable at all levels of the other variable.

3 *Interval data*: Data should be measured at least at the interval level. This assumption is tested by common sense.

4 *Independence*: This assumption, like that of normality, is different depending on the test you’re using. In some cases it means that data from different participants are independent, which means that the behavior of one participant does not influence the behavior of another.

As there is vast support of tools in data collection there are various tools which can also help you to test hypotheses not only by number but visually too e.g. ggplot2, pastecs and psych

So, jump straight into the data with either of these approaches (or both) and forsure you can take any duckling to a journey of becoming a beautiful swan. That's actually start of science, eventually developing a process of learning. And, build a process to learn by itself, whenever a new bird comes it would predict whether it will become a swan or remain to be duckling forever :)

Read the original post on Datum Engineering here

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central