Subscribe to DSC Newsletter

While there are now many data science programs worth attending (see for instance programs from top universities), there are still programs advertising themselves as data science, but that are actually snake oil at worst, misleading at best.

Many fake data scientists work on global warming, on both sides

This one posted on DataCamp (see details in the next paragraph) is essentially old statistics with R, and won't help you get a data science job, or work on interesting projects. Having said that, you should learn R anyway, it's part of data science. But please learn it and apply it to projects that will help you in your career, not old-fashioned examples like the famous iris data set, made up of 3 clusters, and 150 observations. And don't waste any money and time (tons of hours) on such a program when the material is available for free on the Internet, and can be assimilated in a few hours.

Example of fake data science program

Build a solid foundation in data science, and strengthen your R programming skills. Learn how to do data manipulation, visualization and more with our R tutorials.

  • Write your first R code, and discover vectors, matrices, data frames and lists.
  • Seven Courses on student t-tests, ANOVA, correlation, regression, and more. (26 hours)

How to detect fake data science

Anytime you see a program dominated by ANOVA, t-tests, linear regression, and generally speaking, stuff published in any statistics 101 textbook dating back to 1930 (when computers did not exist), you are not dealing with actual data science. While it is true that data science has many flavors and does involve a bit of old-fashioned statistical science, most of the statistical theory behind data science has been entirely rewritten in the last 10 years, and in many occasions, invented from scratch to solve big data problems. You can find the real stuff for instance in Dr. Granville's Wiley book and his upcoming Data Science 2.0 book (for free), as well as in DSC's data science research lab. The material can be understood by any engineer with limited or no statistical background. It is indeed designed for them, and for automation / black-box usage - something classical statistics has been unable to achieve so far.

Also you don't need to know matrix algebra to practice modern data science. When you see 'matrices' in a data science program, it's a reg flag, in my opinion.

More warnings about traditional statistical science

Some statisticians claim that what data scientists do is statistics, and that we are ignorant of this fact. There are many ways to solve a problem: sometimes the data science solution is identical to the statistical solution. But typically, it is far more easy to understand and scale the data science solution. An engineer, or someone familiar with algorithms or databases, or a business manager, will easily understand. Data science, unlike statistics, has no mystery, and does not force you to choose between hundreds of procedures to solve a problem.

In some ways, data science is more simple, unified, and powerful than statistics, and in some ways more complicated as it requires strong domain expertise for successful implementation, as well as expertise in defining the right metrics, and chasing or designing the right data (along with data harvesting schedules). 

It is interesting that Granville's article on predicting records was criticized by statisticians as being statistical science, while in fact you can understand the details without having ever attended any basic statistics classes. Likewise, engineers extensively use concepts such as copulas - something that looks very much like statistical science - yet has never been used by classical statisticians (it's not in any of their textbooks).

In short, some statisticians are isolating themselves more and more from the business world, while at the same time claiming that we - data scientists - are just statisticians with a new name. Nothing could be more wrong. 

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 23736

Reply to This

Replies to This Discussion

The future of data science is to have algorithm parameters fine-tuned automatically. Just like a Google car (without human driver) has its parameters automatically fine-tuned for increased performance and to avoid car crashes. The results are superior to made-made piloting. The same applies to piloting planes, boats, or trains, and even to soldiers replaced by robots. But it requires a different type of algorithms, that are stable and suitable for industrial black-box implementation. This is billion miles away from statistical science, and is true data science.

I mostly agree with this, but two points.

First, a knowledge of traditional statistics is a useful place to start for some people. It often includes a strong dose of descriptive statistics and visualization in intro classes (at least the one I taught did) that is necessary for understanding the data in the first place, and can help clarify the implications of missing data (specifically the difference between missing at random and missing but not randomly). Additionally, a well taught stats class teaches about assumptions, something that some techniques (naive bates comes immediately to mind) have as well. Even if all the class does is impress upon students the importance of understanding their data before they start, it is still valuable.

Second, while I agree that a data science program should not be mostly those techniques, I believe we do ourselves a disservice if we eliminate regression and logistic regression from our toolbox. Both techniques can be useful, and are sometimes the best solution available (assuming the data fits the assumptions). Depending on the field, execs may be more comfortable with statistics than with some of the more black box techniques in data science. (They may not want to calculate a logistic regression, but they may feel more comfortable with it than with the hidden nodes of a neural network and the results may be very similar.)

However it is a very big toolbox, and no one type of tool should be it. Data science is far more than just statistics, but we shouldn't kick the statisticians out of the clubhouse completely.

I am currently developing a graduate certificate for a particular industry, and out of 5 courses I expect to spend no more than 1/4 of a single class talking about regression and logistic regression. None at all on t-test or ANOVA. My students wouldn't have a full toolbox without these items.

However I expect about half the class I am developing on data understanding and preparation will look to traditional statistics, albeit presented somewhat differently. Everything from descriptive statistics (mean, median, mode, correlation), data types and basic exploratory graphing techniques are all taught in normal intro stats classes. The difference will be how I contextual ice those techniques into a world of data science, big data and the tools we will use.
Am I correct therefore in seeing you as advocating a move toward the software engineering end of the spectrum? While I think those skills are useful, an worry that most business users still think these things are more terminator than anything else. I believe we all have at least a decade of explaining this stuff, even if behind the scenes we are automating it, before we will reach that point. And we will push a lot of really smart people out of the field if we emphasize that skill set over others.



Vincent Granville said:

The future of data science is to have algorithm parameters fine-tuned automatically. Just like a Google car (without human driver) has its parameters automatically fine-tuned for increased performance and to avoid car crashes. The results are superior to made-made piloting. The same applies to piloting planes, boats, or trains, and even to soldiers replaced by robots. But it requires a different type of algorithms, that are stable and suitable for industrial black-box implementation. This is billion miles away from statistical science, and is true data science.

Rebecca, these regression-like techniques or p-values invented before the era of computers, are unusually difficult to understand, and obscure to the non-initiated, while lacking stability. Little time should be spent on them in any data science curriculum, but instead the focus should be on their robust version. I would not dig into the theory of random variables and maximum likelihood estimation. Confidence intervals (and other statistical techniques) can be done (taught) without talking about the theory of probability: see my paper. Very good, robust approximations to linear regression can be achieved with rudimentary techniques without any coding, see another of my papers. Indeed, the most complicated part is getting the data, designing the data collection process, extracting online data, and identifying the right metrics even before data collection starts.

What I've seen over and over is books where the first 150 pages are about basic probabilities, random variables, expectation, variance, distributions, conditional distributions; then another 150 pages about regression, confidence intervals, ANOVA, statistical tests, then 20 pages about and "advanced" topic such as Markov chains, time series, model-fitting, limit theorems, or non-parametric statistics. Applications usually involve the same kind of artificial data sets, like the iris of Fisher. This is not data science, these topics should cover well below 10% of any data science curriculum, not 100%. 

I think we are basically in agreement. Even though I intend to touch on those techniques, I don't intend to spend a whole lot of time on the math. I intend to talk about when to consider them and how to use them. However I am doing that with nearly all the techniques I am teaching.

I find that an awful lot of the program's I see are are very heavy on the theory and math. The business analytics program in my university's b-school is like that. Last January I got a grad assistant from that program. He told me he learned more from me in the first 3 weeks than he learned in all of his classes.

I actually worry about his peers. Throughout an entire masters program they never had to clean their own data (let alone get it from the myriad places it usually lives). A LOT of (particularly) business schools are creating programs in order to grab market share with no real thought about what the real job is. Most are run by researchers who only know statistics. Therefore that is what they teach. Few, if any, have applied experience. Unfortunately that is the real root of the problem.

Perhaps a better screening rule is that if the instructors/main faculty are full time faculty with little industry experience, that might not be the best program. Ask about what consulting projects the faculty has done - that will tell you whether you will be able to apply what you learn.

Vincent Granville said:

Rebecca, these regression-like techniques or p-values invented before the era of computers, are unusually difficult to understand, and obscure to the non-initiated, while lacking stability. Little time should be spent on them in any data science curriculum, but instead the focus should be on their robust version. I would not dig into the theory of random variables and maximum likelihood estimation. Confidence intervals (and other statistical techniques) can be done (taught) without talking about the theory of probability: see my paper. Very good, robust approximations to linear regression can be achieved with rudimentary techniques without any coding, see another of my papers. Indeed, the most complicated part is getting the data, designing the data collection process, extracting online data, and identifying the right metrics even before data collection starts.

What I've seen over and over is books where the first 150 pages are about basic probabilities, random variables, expectation, variance, distributions, conditional distributions; then another 150 pages about regression, confidence intervals, ANOVA, statistical tests, then 20 pages about and "advanced" topic such as Markov chains, time series, model-fitting, limit theorems, or non-parametric statistics. Applications usually involve the same kind of artificial data sets, like the iris of Fisher. This is not data science, these topics should cover well below 10% of any data science curriculum, not 100%. 

This is one of the worst advice and suggestion you can ever give to budding data scientist and young learners. As per you guys, one doesn't need to know statistics much and it should be just <10% of the whole curriculum. Are you guys seriously data scientists? 

To vincent-

Who said you don't need to understand the theory of probability and , expectation, Estimation?. Do you think just by running a Random Forests or SVM and it gives good result one becomes a data scientist?. What made you say Maximum Likelihood Estimation and others are not at all useful? If you don't understand MLE, then would you be able to understand Expectation-Maximation algo? Ohh. I forgot as per you one doesn't even need to understand Expecation. Then how on earth would you grasp completely EM algo. Now you would say EM is not required. Really? You don't need EM to do density estimation? You don't use it in Discrimant Analysis. For Gaussian Mixture model or Infinite Mixture models like Latent Dirichlet Distribution?. How on earth are you supposed to do Bayesian Analysis when you have not even tried learning Maximum Likelihood estimation. Would you understand what prior is , what is posterior and what is likelihood function?. Then how do you expect to understand Naive Bayes? Oh. I undestand you would memorize just the Bayes theorem for Naive Bayes, without understanding what exactly Bayes Theorem does in terms of Posterior and Prior prob. And if you don't understand basic statistical distribution theory, how do you expect to understand simulation techniques like MCMC and bayesian Hierarchial Models. Isn't MCMC used in deep learning too? Isn't Restricted Boltzmann Machines form of hirearchial Models? 

I am shocked to see a response and an advice from you guys. Just because you have been able to run some off the shelf algo without having felt a need to understand in detail as it gives you good accuracy you think statistics is not even useful? Are you kidding me? 

Let me ask one simple question. Random forests are known to perform better than Bagged trees which perform better than normal trees. You can say one reason being as in Random Forests we only take a random subsets of variables under consideration at each node rather than all nodes under consideration. Ok makes sense. Now if I ask why does that help. Then you would tell me if you have done more reading that it de-correlates the trees. Ok Makes sense again. How does that help? If you have done more reading you would say oh, decorrelating the trees help in reducing the variance of the final estimator. Ok. But how? If I ask you what is the mathematical basis of your last statement how are you going to prove that to me unless you tell me what is the variance formula of a combination of random variables? You would say if I have two random variables like X and Y then Var(X,Y)= Var(X) +Var(Y) +2Cov(X,Y). This is a crude formula , the actual random forests formula of variance if trees are correlated is a bit different but it gives an idea. If no correlation cov is zero and hence variance is less and all. But to even try to explain something at that level, wouldn't you need to understand what are random variables and distribution theory and how random variables are related in terms of expectation and variancees? And you say these are not important. Don't you realise talking about reducing the variance of the estimator and all is inherently statistics?. Why do you have to reduce the variance of an estimator why not take any estimator. How does your bias-variance curve work? Don't you think the whole decomposition of Mean Squared Error of a regression into Squared Bias + variance is a fundamentally statistics theory on estimators?. And how do you think you would grasp theory about estimators if you don't even want to study what estimation is in the first place?

Or what about your favorite Cross validation or LOOCV tecniques for model selection and assessment. May I ask why does Cross-validation work and why not just use one single training data as an estimate of test error?. Don't you think if someone understand Law of Large no. theory here she would be able to understand why Resampling methods like Cross validation and LOOCV etc are close estimates of generalization error rather than using one single training error? Can you understand the reason why cross validation works outside of just saying it is used to tune the hyperparameters etc without knowing Law of Larger no. theory etc? 

Thirdly, I understand p values are not very right and that is why you have bayesian measures. But a data scientist job is not just machine learning. Many times you have to do inference based work like A/B test. Ask companies like Google and Facebook and MS which do tonnes of A/B test every day. How are you supposed to A/B test if you dont understand hypothesis testing, Experimental Design etc? 

I am seriously appalled by the advice you guys give. Just because you can end up using any ml algo without having to learn much stats doesnt mean that is the right way. 

I don't understand how baking powder makes baked goods rise, yet I have learned how to bake, how to trouble shoot a cookie that is flat, and what types of recipes work best with baking soda as opposed to baking powder. For that matter, I don't understand how electricity and some metal work to produce heat in my oven, yet I can bake my cookies, fine tune the temp I need and identify when the oven isn't working and I need the help of an expert.

We go through every day of our lives not understanding the why's of much of what we do, yet we can become quite expert at many of them.

Data science is a HUGE undertaking, and if you buy Vincent's perspective, one that is rapidly becoming more about systems engineering and programming than anything else. Statistics is one part of it. And while, as I said, I think you need some familiarity with it, it is unreasonable to expect every data scientist to be an expert in stats, software development, machine learning, not to mention a functional area or two. We need people who understand all that stuff to the level you are suggesting, but it cannot be the main or only focus of the training. New data scientists need a wide range of skills, of which statistics is one. But much like an excellent cook or pastry chef, they need some recipe templates (most cooking schools teach proportions) and a good grounding in what types of things to consider if you aren't happy with the results.

Depth and the ability to answer strings of 'why' questions like you mention comes with experience and, quite frankly, coming up against problems that require that level of knowledge. At the start they need more breadth of knowledge about the process. (I have a grad assistant who can explain in detail how maximum likelihood works and all the stats that go into a Monte Carlo model, but had never had to figure out how to bring together multiple data sources at different levels of granularity before he started with me. Would you really want to hire him? He didn't even realize that it mattered, which is just asking for a mess of a model.)

Would you like to hire someone who doesn't even know how random forests bring an advantage over other trees and how to actually tune the parameterss? Like should I be using more no. of trees or less or if more than why more? Do you think then all it comes down to simply trying all possible combinations and see which ones gives the lowest error? If that is the way you want the data scientist to work, then I would never like to hire one of your students then. Anybody can take a coursera course and start running ml using scikit learn. But what tuning needs to be done and how does ppl know? Let me tell you one thing . You want to run clustering algo. You have learnt K means algo right. You run it without looking at the data and you dont even know that k means is very sensitive to initial seed points. Lets for a moment forget about initial seed point. You run k means and you get some clusters. Now may I ask you one thing? If suppose your data is not sperically distributed and is like ellipsoidically distributed do you think K means is the right technique you used?. Shouldnt you be working with GMM for such data. But how will you know? I forgot you have just learnt how to run the algo, you have no clue whether the clusters which are given are right or wrong. Because you never bothered about understanding the right way of running the algo. 

Its a two step process. You need thorough statistics to do data science the right way. And you also need good computing skills to put to use that statistics you have learnt. It cant be one way street. You have to learn both. But just bashing statistics because you never felt the need to know whether you are using the algo in a right way or not is totally wrong. You cant ignore statistics as you cant ignore coding and computing skills for it.

You can understand how to tune an algorithm, which models work best for which types of data distributions and all sorts of other things without digging in to statistics to the level you are suggesting.  

I completely agree that a top, experienced data scientist does need to understand all the whys and wherefores in order to handle those the curve balls.  But one just out of school will have that level of expertise only at the expense of other topics that they didn't learn.  The point made above was about training and degree programs, NOT about the top practitioners.

So tell me this - if you want me to add a ton of stats to my program, what do you suggest I drop?  Data prep/management  (which I teach including descriptive stats)?  Data visualization?  Text analytics?  In the end its a time-issue.  Programs can only cover so much, and if TOO much of that is stats then other things will suffer.

Aha, I see, this article is back to bashing statisticians.

What's wrong with ANOVA?  And what makes you think that knowledge of ANOVA is redundant & not useful?

So, according to you, the following is something to be avoided by those who want to get into data-science?

"Multivariate wavelet kernel (ANOVA) regression method"

https://hal.archives-ouvertes.fr/hal-00616280/document

The paper above for WK-ANOVA is relatively recent & it has shown that it out performs COSSO (COmponent Selection & Smoothing Operator).

"Component Selection and Smoothing in Multivariate Nonparametric Regression"

http://www4.stat.ncsu.edu/~hzhang/paper/thecosso.pdf

Any data scientist must start from basic & that means start learning what ANOVA is & learn descriptive statistics first. That's the key to becoming a data scientist, the learner must start from basic level Stats 101.  And for God's sake, can we stop bashing statistics as if we are superior to statisticians?

Quote :  "Also you don't need to know matrix algebra"

I think that the author is clueless to what he's talking about.

Most of the state of the art methods in data science is Matrix factorization, be it  QR,  EigenDecomposision, SVD, NMF, ICA, Random-Projection, Cholesky, LU factorization, Isopmap, Locally-Linear-Embedding, and so forth. To be a data-scientist, one must master matrix algebra to understand their advantages of when to use & their pitfalls of when to avoid. Otherwise, one can just learn a tool & be a tool user in say SAS, then that's all what one wants to learn to be a data scientist, well at least it is what the author is advocating for, just be an expert tool user where one doesn't need to fully understand the underlying methods/algorithms.

This is not about bashing statisticians, and anyway you can't put all statisticians in the same bag. It's about explaining the difference between data science practice, and statistical practice. Saying that data scientists don't need to know astronomy does not mean that astronomy is bad, it means that, unless you analyze space data, you can do without knowing it. Same with statistics.

Reply to Discussion

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service