Thinking of data science as merely a technical profession, like programming, may take you away from your goals. Focusing on the usability of mathematics for data science before jumping into full-fledged math courses will save you a lot of time.
I wrote this blog post because I made a few mistakes while starting out as a data scientist. I think a lot of people are making the same transition, and you may be one of them or you know one (or a few) of them. I don’t want you to make the same mistakes, hence this blog post. Even if you are not an industrial software developer like me, more than half of this stuff is still going to be very useful to you. Here we go:
Recently, I read a very short history of data science by Gil Press and history, as usual, is always interesting:
While reading it, I remembered the time when I started learning programming back in 2005. I was so into the history of computing, the history of software, the history of hardware, and origins of Open-Source software, History of Hackers, The GNU Project, 20 years of Berkeley UNIX, How to Become a Hacker, etc. I remember coming across one hexadecimal instruction in a system program being used by one of the BSDs and that hexadecimal instruction was the programmer’s birthday. Back in the time when high-level languages meant C, he needed a unique and memorable instruction inside the program and he used his birthday. I can’t recall what exactly that was but something like this: 0x20191211 (look closely, 0xYEAR then MONTH and then DATE in that order). What days those were =:-). It was an amazing journey of being a programmer. I wrote code every day. I wrote code with every breath I took. Nothing could replace the joy of programming. When I was searching for the history of data science, I came across this very well written account by Gil Press. It reminded me of old times. I think it is true that you can’t get passionate about a profession just with logic and highly trained skills, the heart has to be involved too. And in fact, the heart is the first requirement if you want to attain a high amount of skill.
Mistake #1: Not Understanding The Two Cultures
Statistics has been around for a long time and it has worked well enough to handle any and every kind of data before Big Data arrived. Traditionally, Statistics has dealt with small data for a long time. You can find a lot of articles on small vs big data, but this post is not about the comparison. This post is about one author I noticed in Gil’s article, Leo Breiman, who wrote Statistical Modeling: The Two Cultures back in 2001. Here is the abstract:
There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.
That is a very, very important point neglected by many people starting in data science. When I shifted my career from a software developer to a data scientist, one thing I was hit hard by was the Mathematics involved, especially Statistics, Probability, Linear Algebra, and Calculus, almost in that order of importance. So, I spent a few months learning all four. It was good, except that it was not. While all of the mathematics I learned was quite interesting, the bigger question was: did I need it for my transition? The answer is No. I learned all that and I noticed it was not of much use when I dealt with real-life Big Data. When you are working on solving problems using data science in a business corporation, that much math, that many exercises from a book, theoretical problems, and deep study or research won’t all become useful. You must not do it. I felt like I lived in a cave and learned all the Mathematics I needed and when I came out of the cave into the real-world of a software corporation, software as a business, all of my dreams were shattered. So I can say now with experience that Leo Breiman was correct. It was not the best use of time for an industrial software developer. It was not the best use of time for a software developer looking to shift his career within the software industry. I should have known better. I realized it rather late. I can’t get those months of my life back. What I can do is to use this mistake to make better decisions this time.
Mistake #2: Not Understanding The Shift In Industrial Focus
Times change. In the last 30 years, an immense amount of software has been produced. Marc Andreessen even said that software is eating the world. It was true I think:
In the last few years, the focus has changed within the software industry from creating a lot of software to using the software. With all the advancement in technology and hardware, the software is still being created but that is not where all the buzz is, now the buzz is around usability. It has shifted from C to Python. C model says hardware’s time is more important than developer’s because hardware was very expensive back in those days. Python model says developer’s time is more important than the hardware because the hardware is now cheap and developers putting time where it is not needed counts as a loss for the business. With this came the shift from creation to usability. And this change in focus is expanding at an alarming rate and according to Jeetu Patel, software is still eating the world, but in a different way and on a different path:
Mistake #3: Ignorance Of The Obvious: Rise of Social Media & Expansion of Internet
And according to Jeff Desjardins, by 2025, 463 exabytes of data will be generated each day.
Businesses are asking this important question: What we are doing with all this data?
What Can We Learn From Those Three Mistakes Above?
First, buying a book on Statistics, Probability or Linear Algebra that is being used in academia will be a complete waste of your time. The book itself may be very useful and has not much of a place in the software industry. When you are going to work in Big Data in the software industry, you need to know what tools this industry uses and where you can learn them. This is major. The academic books are a minor. Don’t major in minor stuff.
So, the minor here is what Leo Breiman called data modeling, the traditional statistics. Major here is the algorithmic modeling, the techniques, and methods of working with the complex world of Big Data. you are looking at An Introduction to Statistical Learning (ISL):
This book is where you need to spend major of your time. This book feels bit mathematical of course, but you gotta get used to doing that. At least it contains less Mathematics than their The Elements of Statistical Learning (ESL) which hurt my head like The Art of Computer Programming did. ESL is more of a research-oriented book while ISL is more inclined towards real-life data analysis. Both books you can download for free from home pages I linked above. I advise you to buy the hard copies because it ain’t fun reading an 800-page book on a computer. From my experience, one absorbs more content and remembers better when learning from a hard-copy.
Now, it is not to say that traditional Statistics is of not much use. Many traditional concepts are still used in Big Data and they are fundamental to understanding anything related to working with data. So, you still need to spend minor time on traditional statistics and probability. Yes, it is minor time but still, it is the time you need to invest.
Along with those MOOCs, one needs to have a broader understanding of the practical usage of Statistics and Probability in real-life. I recommend this excellent book for anyone to read, whether they want to become a data scientist or not:
Even though it was written back in 1988, concepts mentioned in this book are evergreen. This book will impact deeply the way you think about Mathematics (well, mostly about Statistics and Probability to be accurate).
And this is the place I am standing right now. I have already done those MOOCs mentioned above and I have read Innumeracy, have ordered and got ISL too. I will be working through ISL and will share my experience a couple of weeks later. I am also working on the obstacles I am facing while searching for, and getting, a data science job. I will write about this too once I am successful in breaking down all the obstacles.