I have been following the news on data science and wanted to share some of the titles here : The sexiest job of the 21th century, Ten Trends on Data Science, How to become a data scientist, What is data science, so on..
I wonder how and when data and science became sexy and everybody started to have a stake in data science. Is it because young energetic technology companies of Silicon Valley HAS named it that way? Do we need more data or do we have more data nowadays? Whether the data make science sexy or the otherway around?
All of these questions in my mind, I was sitting in my library looking at my books that never looked so sexy before. The second edition of Schaum’s Statistics book published in 1988 and purchased in 1993. I remember using it over and over to develop correlation algorithms to identify types of bread molds. I liked what I did in late 90s as a researcher (back then people developing algorithms were called researchers); but, I don’t recall anyone envying my job or calling it sexy. My attention moves to another forgotten book by Oppenheim: Signals and Systems. That book always meant long days and nights to understand the nature of Fourier transform developing algorithms and taking it a step further to Ceptrum Domain ( a logarithmic approach to separate frequency and phase components of signal/data).
I remember the days of Bezdek, fuzzy c-means clustering. My humble team developed algorithms to classify landmines in Angola. We spent a lot of time looking at the data, matrices and vectors before selecting a random sample group. Principal component analysis was another popular method to compress the data to decrease the cost of algorithms. It was not too long ago that I wrote my dissertation on it in 2010.
Within all those algorithms and applications my favorite is a very simple method called clipping. When I realized that outliners might have some information to develop forecasting algorithms I was so impressed with the power of clipping. It is basically a fuzzy thresholding. You identify a threshold (there are a lot of ways of identifying thresholds; averaging the data, averaging chunk of data, etc.) and change your values to zero if the value is smaller than the threshold; otherwise, keep it as it is. It was so sexy to me that I had a higher resolution in my data and I could recover more features. It made the algorithms slower and costly; but, who cares in this age of cloud and powerful computers.
These were the days that MATLAB crashed over and over, had problems with averaging and filtering. We all needed to validate what we were doing. I was wondering if we still need to validate what we are doing with data and try to learn from the nature of the data? Or else, are we a step further that all datasets are the same? Can we trust to commercial products and press a button to puke graphs and histograms? Is that why data science became so sexy?
All in all, the message I am trying to give is that data science is becoming a cluster of a lot of things and nothing. We forget about data itself and focus on how many times we can click in a second using powerful package algorithms.