With loads of content and hype about Data Science, Analytics and Big Data coming up every day, I felt compelled to share my journey and experiences in exploring this space. Here, I intend to begin a series of posts on the subject stripped it down to it's core (as I understand it) and than building upon it.
While there are a whole bunch of directions these subjects could be approached from, I present here the path that I found most convenient to tread. I would like to call it the path of the 'Data Shinobi' (another word for the 'Ninja') where you understand and gain expertise in the techniques just enough and relevant to the task at hand unlike a seasoned 'Data Scientist' who has mastered most of these techniques over many years. It must also be noted that when I mention the 'Shinobi' it's in the context of the training that shinobi's underwent rather than their expertise and mastery over skills.
(Image Source: Here)
To begin with, here's a list of 5 fundamental facts about Data Science which I discovered early on my journey:
#1: Data Cleaning is almost always a pain - In most cases, you have an amazing approach and models to get your solution up and running. However, in most cases you have to wait for what feels like ages to get your data ready. Usually you could spend about 70-90% of your time reshaping your data to be processed.
#2: Great visualizations sometimes overshadow amazing models - 'A picture is worth a 1000 words' still works here. So, building a great model or rebuilding the data architecture to speed up analytics needs to be 1000 times more amazing if it isn't backed up by a stunning visualization. I believe visualizations need more effort from the other half of the brain and sometimes people who have no idea about quantitative modeling could suggest or build really beautiful visualizations that make you think "It makes perfect when you look at it like that.."
#3: There are 2 fundamentally different approaches in Data Science - One, and probably the older, is Statistical Learning that has evolved from traditional statistics and works on error minimization through generalization techniques. The other is Machine Learning which is probably the way a Computer scientist would approach a problem. Machine learning techniques utilize error minimization by recursively training the 'machine' to make better predictions. The line between the two is getting thinner as they seem to borrow some concepts from each other time-to-time.
#4: Not all problems can be boiled down to the textbook techniques anymore - Traditional buckets under which techniques or approaches were classified were - Regression, Classification, Clustering and so on...(this is almost always the 1st chapter in most books on Data Science). However, some techniques such as LDA for Text Analytics might lie outside these buckets. Similarly SVD and ALS used for recommenders are quite different techniques.
#5: There are always the unknowns from new fields - Even though you master some techniques and have a good grip on R/ SAS, there are always new challenges you face which makes Data Science interesting. Additionally, there might be a complete field where a domain expert might know a lot more than you do. For Example, Supply Chain experts understand a lot about optimization and forecasting techniques. It's hard for a Data Scientist without Supply Chain techniques knowledge to apply Stat learning techniques to problems in this field. In the end continuous improvement (or 'Kaizen') is what would work well here as well.