Meeting the Data: What to do Once we Have Data in Our Hands? (a pathway for those starting with data science)
Many times those starting with data science don't know what to do with data once they have a dataset at hand. Where to start, which analysis to do, what to consider in the analysis and which tool to use are common questions not only posed by beginners. This article gives the author's perspective (learned with the study of some experts) to help the beginner in data science to know what to do with data. Let's go then!
First of all it is important to have a tidy dataset where each variable has its own column (if we deal with tabular data) and where each column has the same data type and its own name (meaning something about the values that it holds). There are great articles and tutorials about data tidying that will be referenced at the end of the article, since this is not our primary focus. In summary the data has to be well formatted and following a certain logic.
Once we have a clean dataset it is important to know what the data is about. We can have a medical dataset, but does it contain data about diseases, patient records, or pharmaceutical data about medicines? It is important to know a good deal about the data we have at hand. The more we know about the variables, the better.
After knowing the most we can about the data, we need to have a goal in mind. "What do I want to pull out of this dataset?" is a good question to ask ourselves. If we have a clear goal in mind, great! To predict one of its variables, to find some correlation between two variables or just summarize the data to make a report are examples of goals one can have in mind. If we don't have a clear goal, it is a good idea to start with descriptive statistics taking the mean, mode, median, standard deviation and frequency of the data. It is a good idea to plot the variables we find most interesting to see if we can identify any visible trend. Just play around with data! Doing so becomes easier to formulate our "goal" for the analysis.
Maybe this is enough and we have already found what we were looking for. If reporting numbers of summarized data is the case, then we have already done a descriptive data analysis! But we can take a step further to "exploratory analysis" trying to squeeze some more juice out of the dataset.
An exploratory data analysis builds upon and complements a descriptive analysis. In this type of analysis we look for discoveries, patterns, trends, correlations among the variables. It is a good idea to plot some variables of interest, run a correlation function on some variables or classify the data into clusters. It is interesting to note that at this point we get a little more "acquainted" with the dataset and we can start to have some new insights. In a nutshell, this is what exploratory data analysis consists of.
With our findings we can make inferences or predictions using the data. We do inferential data analysis when we quantify whether our findings will likely hold beyond the dataset in hand. This is a very common statistical analysis in the formal scientific literature. An example can be the discoveries made back then about smoking being related to lung cancer. All started with some findings showing that people who used to smoke a lot also had lung cancer. Then researchers made inferences about it kick-starting a deeper study.
A predictive data analysis uses a subset of variables (independent variables) to predict another variable (the dependent variable). The most common techniques used in predictive analytics are linear regression and logistic regression. Some examples of this type of analysis are when organizations try to predict the total amount of phone calls their call center will receive in a given day, or when banks try to predict if a person will default. A more detailed example can be found in the book "Moneyball: The Art of Winning an Unfair Game" by Michael Lewis, where the author shows how predictive analytics helped a baseball team to win more games using, among other things, linear regression. This idea was such a success that changed many coaches approach to the sport.
So, basically the steps we need to take in order to start a sound data analysis are: first, to have a tidy dataset; second, to have a goal in mind (formulate what we want to extract from the data); third, do some descriptive analysis; fourth, do some exploratory analysis plotting variables to find correlations and trends and fifth, make inferences or predictions with the discoveries.
Regarding the tools to use, if we want to prepare the dataset getting variables from multiple tables or databases, SQL is very handy since the vast majority of relational databases uses it. After the dataset is ready descriptive analysis can be done in a spreadsheet software like MS Excel using the analysis tab. To do more advanced analysis we will need a more powerful tool like R, Python, Mathlab, SAS or WEKA. I prefer to use R or SAS even for descriptive analysis and use spreadsheet only for reporting. SAS is a great tool (the best in my humble opinion) but is not free. It actually has a limited free edition called SAS University Edition that is worth checking out to learn about the tool. R in the other hand is free and has many packages that make data analysis a lot easier. For most cases I think that R can answer any analyst´s demands. Python is a great programming language that has its popularity growing very fast among data scientists. With Python one can do data analysis and also do traditional programming (like a web page for instance). Mathlab is another powerful tool very popular in Machine Learning and complex math calculus. It has a free version called Octave. It is also a good idea to check it out. WEKA is a free software for data analysis developed in Java by the University of Waikato. It is part of the Pentaho Business Intelligence suite.
Like these there are so many tools, but these are the most widely used in the field. For the novice I think SQL, R or Python are the ones to go for. After knowing a good deal about them, one can move on to SAS, Mathlab, WEKA and so on.
I hope to have helped, especially the novices, to outline the path for a more organized data analysis and to know what to do once we have data in our hands.
About a clean dataset read the article “Tidy Data” by Hadley Wickham in “The Journal of Statistical Software” at http://www.jstatsoft.org and the ebook “Practical Data Cleaning” by Lee Baker at https://leanpub.com/practicaldatacleaning
For those more interested in the subject of this article, check out the ebook “The Elements of Data Style” by Jeff Leek at https://leanpub.com/datastyle
Flavio Bossolan is a data analyst and Machine Learning enthusiast. He has experience in data analysis using SQL, SAS, R, Octave and spreadsheets. He has worked with default financial analysis for Basel II requirements and financial loss predictive analysis for one of Brazil´s major private banks. He currently works with data analysis related to clients´ experience at Telefónica. He can be reached at [email protected] or at https://br.linkedin.com/in/flaviobossolan