Subscribe to DSC Newsletter

While there are now many data science programs worth attending (see for instance programs from top universities), there are still programs advertising themselves as data science, but that are actually snake oil at worst, misleading at best.

Many fake data scientists work on global warming, on both sides

This one posted on DataCamp (see details in the next paragraph) is essentially old statistics with R, and won't help you get a data science job, or work on interesting projects. Having said that, you should learn R anyway, it's part of data science. But please learn it and apply it to projects that will help you in your career, not old-fashioned examples like the famous iris data set, made up of 3 clusters, and 150 observations. And don't waste any money and time (tons of hours) on such a program when the material is available for free on the Internet, and can be assimilated in a few hours.

Example of fake data science program

Build a solid foundation in data science, and strengthen your R programming skills. Learn how to do data manipulation, visualization and more with our R tutorials.

  • Write your first R code, and discover vectors, matrices, data frames and lists.
  • Seven Courses on student t-tests, ANOVA, correlation, regression, and more. (26 hours)

How to detect fake data science

Anytime you see a program dominated by ANOVA, t-tests, linear regression, and generally speaking, stuff published in any statistics 101 textbook dating back to 1930 (when computers did not exist), you are not dealing with actual data science. While it is true that data science has many flavors and does involve a bit of old-fashioned statistical science, most of the statistical theory behind data science has been entirely rewritten in the last 10 years, and in many occasions, invented from scratch to solve big data problems. You can find the real stuff for instance in Dr. Granville's Wiley book and his upcoming Data Science 2.0 book (for free), as well as in DSC's data science research lab. The material can be understood by any engineer with limited or no statistical background. It is indeed designed for them, and for automation / black-box usage - something classical statistics has been unable to achieve so far.

Also you don't need to know matrix algebra to practice modern data science. When you see 'matrices' in a data science program, it's a reg flag, in my opinion.

More warnings about traditional statistical science

Some statisticians claim that what data scientists do is statistics, and that we are ignorant of this fact. There are many ways to solve a problem: sometimes the data science solution is identical to the statistical solution. But typically, it is far more easy to understand and scale the data science solution. An engineer, or someone familiar with algorithms or databases, or a business manager, will easily understand. Data science, unlike statistics, has no mystery, and does not force you to choose between hundreds of procedures to solve a problem.

In some ways, data science is more simple, unified, and powerful than statistics, and in some ways more complicated as it requires strong domain expertise for successful implementation, as well as expertise in defining the right metrics, and chasing or designing the right data (along with data harvesting schedules). 

It is interesting that Granville's article on predicting records was criticized by statisticians as being statistical science, while in fact you can understand the details without having ever attended any basic statistics classes. Likewise, engineers extensively use concepts such as copulas - something that looks very much like statistical science - yet has never been used by classical statisticians (it's not in any of their textbooks).

In short, some statisticians are isolating themselves more and more from the business world, while at the same time claiming that we - data scientists - are just statisticians with a new name. Nothing could be more wrong. 

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 34007

Reply to This

Replies to This Discussion

Quote "There are many ways to reduce dimensionality."

I never said that there are only a few, but matrix factorization is state of the art today. That's undeniable. I can cite you many papers on this, but then I suspect that you don't read much from research journals because   of your disdain for some statisticians in the field.

The article you cited above on "Combinatorial Feature Selection" is feature-selection which is different from feature extraction. Feature extraction is what matrix factorization is about. 

Quote :  "Claiming that the only way to do it involves matrix operations"

Again, I never said that. Go back & read my previous messages.

Quote : "is tunnel vision or lack of exposure to modern techniques"

Again, go back up & read my messages. I'm far wider exposed to modern techniques than most pretentious data scientists on this forum, because I scour the research journals looking for interesting materials every day. Did you read about using  Wavelet-Kernal-ANOVA paper I linked to in in my previous message for its use in feature selection?

Quote : "Most step-wise, iterative processes that I know (for reduction of dimensionality) do not involve matrix operations."

That's a good thing as it gives users/researchers an array of available tools to work with, but then, in the literature, everywhere you look, matrix factorization pops up,  from image analysis in image processing & noise cancellation in signal processing (ICA & NMF  for example).

Quote : "PCA (principal component analysis) does require matrix operations, but it leads to meaningless variables"

There are advanced different variants of PCA ,  from non-linear PCA,  Bayesian PCA,  Kernel-PCA, Tensor-PCA / Multi-linear PCA, Robust PCA, Dual PCA, Fuzzy PCA, Supervised PCA, etc..., so pick one that suits your task.

Quote : "Generally speaking, any solution involving matrices will be unstable when the determinant is close to zero, which happens frequently with variables that are correlated (a common occurrence in practice)."

Researchers had been aware of this problem from decades ago, however it hasn't stop them from using matrices & publishing new research work with state of the art theory in matrix algebra. Its everywhere, one just have to look into the literature today from various domains. There's an explosion of state of the art published material in matrix algebra in the last decade or so and that's fact. That says it all.

Finally, the Google PageRank algorithm is solved via Power-Iteration method which is a matrix algebra problem :

http://en.wikipedia.org/wiki/Power_iteration

Sione, do you think the average (non-academic) data scientist has the time to dig in the literature to read / understand / adapt / scale dozens of methods published in obscure outlets? Re-inventing the wheel would take less time than finding these papers.

There's a lack of unity in statistical methods, too many disparate solutions for the same problem. I try to create a unified approach based on simple principles. One that an automated black-box system (as opposed to a human) can manage successfully. So far I haven't felt the need to include advanced matrix algebra in it, be it for regression, data reduction and many other problems. 

I do not understand why is it so much bashing of data scientists. I am new to this form of analysis  to data and I think an open approach to understanding all aspects of this new data science analysis, which really applies to old ideas are now being rewritten and applied to data science. We all have views as to what is best for criticizing each other is not the answer. The only way to look at this is to have a big picture view and accept that change is always inevitable and some change relies on what happened in the past to enhance what is in the future. 

Carl, this goes both ways. Too much time is spent in many academic programs teaching useless material. This is due to a large proportion of old tenured professors teaching the same things over and over for decades, and adverse to change. In some ways, this is an issue related to the tenure system in the academia.

Check out my data science apprenticeship, I believe it accomplishes a nice balance between getting rid of dead wood, yet keeping fundamental principles that are time-proof.

While I agree that traditional statistics does not equate to data science, I have trouble understand the definition of data science here. We all know this field is very diverse, and it is really hard for everyone to master the entire array of disciplines. I worked with several software developers on several machine learning projects using random forests and logistics-regression-like algorithms. We couldn't reach very high accuracy, only ~80%, but we couldn't figure out how to improve it either as we are all new to the topic at the moment. I found the developer's approach was try-and-error without actually understanding the data,. I feel it's better to know the data more so we can develop a better strategy. I think statistics is a good way to do this step. Though traditional statistics has its limitation, it still help us in certain way. Just like when we learn calculus, we need to know a lot of basic math before we can understand it. I agree not every data scientists need to master it but it is still a very important tool and fundamental if one wants to pursue this career.  

Vincent, you are basically suggesting people doing garbage in and garbage out; the skills you suggested are no more than a customer operator and high school diploma is sufficient plus some computer operation sills.

Vincent Granville said:

Carl, this goes both ways. Too much time is spent in many academic programs teaching useless material. This is due to a large proportion of old tenured professors teaching the same things over and over for decades, and adverse to change. In some ways, this is an issue related to the tenure system in the academia.

Check out my data science apprenticeship, I believe it accomplishes a nice balance between getting rid of dead wood, yet keeping fundamental principles that are time-proof.

I think this post have some good points that will help us for avoiding some scams and everyone should know these points before getting admission under Data scientist course providers. Although there are some best data scientist course providers like ExcelR Solutions archive success to create skilled professionals on data scientist.

We call people who don't understand the data first before applying ML algorithms "Data Monkeys". They play with data or they mock what real data scientists do and call themselves data scientists. In order to make a good use of data, statistics is the basic then linear algebra then programming.

For example, the really basic multivariate regression. Frist, we need to understand it in the statistics area. How do we actually calculate the Beta?? Use linear algebra. Then how do we transform that into code? Computer science programming knowledge. 

You are not a data scientist from the sense that you think statistics and linear algebra are not important in this field and only CS is important. Look up computer vision. That is part of data science and it is a great example of how statistics, linear algebra and computer science get used together.

RSS

Videos

  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service