Subscribe to DSC Newsletter

While there are now many data science programs worth attending (see for instance programs from top universities), there are still programs advertising themselves as data science, but that are actually snake oil at worst, misleading at best.

Many fake data scientists work on global warming, on both sides

This one posted on DataCamp (see details in the next paragraph) is essentially old statistics with R, and won't help you get a data science job, or work on interesting projects. Having said that, you should learn R anyway, it's part of data science. But please learn it and apply it to projects that will help you in your career, not old-fashioned examples like the famous iris data set, made up of 3 clusters, and 150 observations. And don't waste any money and time (tons of hours) on such a program when the material is available for free on the Internet, and can be assimilated in a few hours.

Example of fake data science program

Build a solid foundation in data science, and strengthen your R programming skills. Learn how to do data manipulation, visualization and more with our R tutorials.

  • Write your first R code, and discover vectors, matrices, data frames and lists.
  • Seven Courses on student t-tests, ANOVA, correlation, regression, and more. (26 hours)

How to detect fake data science

Anytime you see a program dominated by ANOVA, t-tests, linear regression, and generally speaking, stuff published in any statistics 101 textbook dating back to 1930 (when computers did not exist), you are not dealing with actual data science. While it is true that data science has many flavors and does involve a bit of old-fashioned statistical science, most of the statistical theory behind data science has been entirely rewritten in the last 10 years, and in many occasions, invented from scratch to solve big data problems. You can find the real stuff for instance in Dr. Granville's Wiley book and his upcoming Data Science 2.0 book (for free), as well as in DSC's data science research lab. The material can be understood by any engineer with limited or no statistical background. It is indeed designed for them, and for automation / black-box usage - something classical statistics has been unable to achieve so far.

Also you don't need to know matrix algebra to practice modern data science. When you see 'matrices' in a data science program, it's a reg flag, in my opinion.

More warnings about traditional statistical science

Some statisticians claim that what data scientists do is statistics, and that we are ignorant of this fact. There are many ways to solve a problem: sometimes the data science solution is identical to the statistical solution. But typically, it is far more easy to understand and scale the data science solution. An engineer, or someone familiar with algorithms or databases, or a business manager, will easily understand. Data science, unlike statistics, has no mystery, and does not force you to choose between hundreds of procedures to solve a problem.

In some ways, data science is more simple, unified, and powerful than statistics, and in some ways more complicated as it requires strong domain expertise for successful implementation, as well as expertise in defining the right metrics, and chasing or designing the right data (along with data harvesting schedules). 

It is interesting that Granville's article on predicting records was criticized by statisticians as being statistical science, while in fact you can understand the details without having ever attended any basic statistics classes. Likewise, engineers extensively use concepts such as copulas - something that looks very much like statistical science - yet has never been used by classical statisticians (it's not in any of their textbooks).

In short, some statisticians are isolating themselves more and more from the business world, while at the same time claiming that we - data scientists - are just statisticians with a new name. Nothing could be more wrong. 

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 34071

Reply to This

Replies to This Discussion

I've never used matrix theory after completed my PhD, not once. Never used standard linear regression either, except on rare occasions, and it was just a couple of lines of code (calling a function), and a 5-min job. I invented Jackknife regression though, which is far more robust, easy to interpret, fast to compute, and almost as accurate. That is what I consider to be data science.

If your knowledge consists of stuff found in books published 80 years ago (linear regression, t-test, R-square), and widely available worldwide for free, how do you expect to get a good job? You'll be under constant threat to have your job outsourced to statisticians in India, Hungary, Egypt, or China. Besides, data scientist is a career, not a skill set. It is ridiculous to

  1. spend so many hours on these linear regression or R classes (you can and should learn R in 3 hours if you know C or some other programming languages),
  2. and spend no time on actual data science

At least, if your purpose is to become a data scientist, command the salary associated with this (senior) job title, and work on the projects that require data science (as opposed to working on projects that do not require data science). My 2 cents.

You can understand how to tune an algorithm without knowing the level of statistics? Thats one of the most ludicrous ardument I have heard. You know what. The problem is that we have tools and lib like scikit learn where people keep on fiddling with parameters unless they get good accuracy and they believe they have become data scientists. 

Let me ask you something then. I give you a dataset. And ask you to do clustering. You have learnt a favorite clustering alog called k-means. You simply run that. Keep fiddling with parameters and get some clusters. Is that correct? Did you even bother visualisaing the data before you used k means. Do you know if the data is ellipsoidically distributed k means would give very wrong ans and data should be shperically distributed. But no, you were busy fiddling with parameters. Why will you bother knowing all that right?. 

And I seriously dont understand why it is difficult to have a curriculum which can include almost everythign. You just need two courses in statistics to understand the statisticial distribution theory, estimation and even hypothesis testing. Then you need two courses in machine learning. One basic machine learning course to be covered in first semester and then one advanced machine learning course to be covered in second semester. Keep Stats 101 in first semester and Theory of statistics(estimation, law of large no, hypothesis testing) in second sem as it requires knowledge from Stat 101. Thats just 4 courses. Then you can put an applied machine learning or data science course in first semester where you can basically introduce a programming lang like python , R and ask to apply machine learning on real datasets. That makes it 5 course. You can then introduce a course on distributed computing and another one on data structures and algorithms as those are very imp once you start working with big datasets. That makes it seven. One last course you can either have on visualization or maybe one more stats course like Bayesian statistics covering simulation techniques like MCMC and hirearchia models etc. Knowing hierarchial models would make someone better understand all these deep learning algos rather than just telling them to use deep learning.

This makes it 8 courses or two semester prog. If you have three semester prog you can introduce more  How on earth are you telling me its not possible? Because you never bothered trying to learn statistics and feel its a whole huge field in itself. If you know what statistics is imp to know thats not more than three academic courses. 

And please, anyone who says matrices are not important and STAT 101 is not important, you need to check your credentials as data scientist. As Sione mentioned above if you do not even understand what span , basis and eigen decompostion , svd is you are basically doing nothing in data science.Right from Computer vision to basic dimensionality techinues, svd is one of the most important toolset which is used. You can mug up svd but thats not the right way to learn. 

And also just because Andrew Ng mentioned in his coursera machine learning class that dont worry about matrices and probability doesnt mean you dont need all that. He said all that because he was giving a abrdiged version of machine learning. Please check his actual course he took in stanford and the videos are there on youtube. First prereq is Stat 101 which covers your probability distribution theory etc. He clearly talks about maximum likelihood estimation for GMM, Discrimant analysis in those video so that also makes it theory of statistics. And you guys say stats is not important. 

If ever you tried checking the pre-reqs of any machine learning academic course from any top or even low ranked university you would understand first pre-req is always proabaility theory. Your whole Statistical learning theory or VC theory in Machine learning itself is based on Law of Large no. But how will you know that? You dont even know law of large no 

Rebecca Barber, PhD said:

You can understand how to tune an algorithm, which models work best for which types of data distributions and all sorts of other things without digging in to statistics to the level you are suggesting.  

I completely agree that a top, experienced data scientist does need to understand all the whys and wherefores in order to handle those the curve balls.  But one just out of school will have that level of expertise only at the expense of other topics that they didn't learn.  The point made above was about training and degree programs, NOT about the top practitioners.

So tell me this - if you want me to add a ton of stats to my program, what do you suggest I drop?  Data prep/management  (which I teach including descriptive stats)?  Data visualization?  Text analytics?  In the end its a time-issue.  Programs can only cover so much, and if TOO much of that is stats then other things will suffer.

Data science is not about eigenvalues and matrix decompositions, not even remotely. I have to disagree on that. The foundation of data science is delivering insights from data, not mathematical statistics. If you need matrix algebra to achieve this goal, that's fine, but the majority of us don't need that level of mathematics to successfully complete data science projects. Manipulating lists and hash tables, identifying the correct variables and datasets before your project starts, can produce the same, if not better results, for a fraction of the cost (no need to hire a statistician and a manager who understands the jargon). There are always 10 different ways to solve a problem. You need to balance simplicity, accuracy with cost of implementation, risk associated with misuse by non experts, maintenance costs, and scalability, when choosing an approach.

@data science girl-

If I give you a data set with 1000 variables with 100 observations. And I ask you to find the most significant variables for a response? How do you think you are going to accomplish this task as per your experience and knoweldge of data science?

Manish, here's a data science solution to your question. No matrix algebra involved. 

While admittedly a bit scary to weigh in amongst you various scientists of different stripes, I will nevertheless offer a different opinion as a business person with a minor (at best) in the science of data analytics as well as a great deal of passion about the topic of job titles in this space.  First of all, I'll start with something I learned years ago in org design work -  i.e. the need to distinguish "roles" from "jobs".  If I want to obtain a perspective of how best to stock my store shelves in front of an upcoming hurricane - I will hope that the following roles are in place - a data "finder" who can source the right weather inputs, a data "normalizer" - who can integrate all inputs to lay out the findings from past hurricanes, a data analyzer who can confidently tell me what they've learned from past patterns that can be used to predict what products will be most likely purchased and a data "visualizer" who can put this all in a format I can share with my fellow executives. I don't care if these roles are filled by 1 person or 3, and I don't care whether it's ANOVA or jackknife regression, but I do care that the person running the analytics has the proper training to identify an appropriate sample size, and the work they produce can stand up to scrutiny of the type of folks that are commenting here.  From a b-school perspective, I think I need to know what regression is, what correlation is, why K-Means is as cool as it is, etc., so I can better envision the art of the possible, but I don't think I need to know how to do it.  I think data science should be a two term option in business school, but nothing more than that.  I think data scientists should be as deep statistically as they can be and I think the elements of data management should best be left to IT groups.

Julia, there's the word data in data scientist. That makes it different from statisticians, and also I believe that as a result, a critical role of data scientists is data management. My 2 cents.

Most importantly, data scientists don't need to be pure technicians focused on theory, they can be MBA's or executives or engineers, with a data vision rather than a bag of tactical tricks. Some practitioners like me are actually data producers, rather than data consumers (I'm on both sides actually). It does not mean that I don't know anything about matrix algebra or eigenvalues - after all I spent 2 years at Cambridge working on my postdoc focusing on Bayesian hierarchical models and MCMC - but it means that  I use whatever strategy works best, given budget constraints, client expertise with stats, scalability, maintenance, replicability, robustness, accuracy, platform-compatibility issues, risks associated with implementing as black-box or real-time,  and more.

Pardon my ignorance, this appears to be something that could be applied for certain esoteric domains like NASA, Boeing, Intel, etc.. What about the future of data science in terms of enterprise level, like as an organization? 

Vincent Granville said:

The future of data science is to have algorithm parameters fine-tuned automatically. Just like a Google car (without human driver) has its parameters automatically fine-tuned for increased performance and to avoid car crashes. The results are superior to made-made piloting. The same applies to piloting planes, boats, or trains, and even to soldiers replaced by robots. But it requires a different type of algorithms, that are stable and suitable for industrial black-box implementation. This is billion miles away from statistical science, and is true data science.

Why, precisely, do you think it is appropriate to assert that a profile is fake?  Honestly, I just wrote off half of what you've said based on your attacking of posters rather than their argument.  After all, this appears to be the only thread you've ever responded to as well.  What makes you any different from Data Science Girl?

BTW, you are aware that this is Vincent's site, right?  You can agree or disagree with him, but he certainly has no need to create fake profiles just to make a point; he has always been very straight-forward.  

You, on the other hand, haven't told us anything about your bona fides.  Clearly they didn't explain to you what an ad hominem argument is or why attacking the person rather than their argument is particularly ineffective.


Manish Tripathi said:

With all due respect, firstly I would appreciate if you stop fake profiles of women like the above Data Science Girl from posting anything here. Don't create fake profiles to bolster your post or point. 

Secondly, again witht full due respect, I have started doubting your own claims about academic degrees from Cambridge and work on MCMC etc. Could you enlighten us with your credentials please?. I would appreciate more if you have no academic credentials as that doesn't matter much and come out clean as a person who learnt all by himself rather than dropping a line here an there about your PhD and Post Doc. That would help us a lot instead of talking to fake people and profiles here

Data Science Girl said "Data science is not about eigenvalues and matrix decomposition, not even remotely".

Did I say that?  Umm,  I think you're putting words into my mouth. Besides, matrix decomposition I listed above are not all eigenvalues. ICA is not. NMF is not, etc,.... You seem to indirectly infer & lump every matrix decomposition methods as eigen values . I said in my previous messages in reply to the author of this article when he dismissed or discouraged learning matrix theory in order to become a data scientist, that  most of the sophisticated techniques of today in data science is matrix based & it seemed to me that the author hasn't experienced using sophisticated matrix based data analysis before to make such a comment, considering that he considers himself as an authority in this domain of data science.

If you haven't work with matrices Data science girl, then I agree with what Manish said in his comment above, when he asked what you do with billion variables in a dataset?  Any idea?  Let me give you a hint.  You reduce the dimension of the dataset to lower rank & you do your analysis in the low rank space that your high dimensional data has been projected down to. It can reveal hidden features/patterns that are not easily analysed in the original contaminated high dimensional data.

See how these guys at Microsoft reduce a huge dataset of   ~ 44 millions by 769 millions matrix to low rank using Gaussian, Poisson & Exponential  Non-negative matrix-factorization implemented on Hadoop.

"Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce"

http://research.microsoft.com/pubs/119077/DNMF.pdf

Again, back to Manish's question.  How would you deal with a huge dataset like that Data Science Girl?  Umm, I think you would just analyze the raw high dimensional data without reducing it to low rank.

There are many ways to reduce dimensionality. Again read this article, it's one modern way to do it, not involving matrices, but instead using algorithms like simulated annealing. Claiming that the only way to do it involves matrix operations, is tunnel vision or lack of exposure to modern techniques. Most step-wise, iterative processes that I know (for reduction of dimensionality) do not involve matrix operations. The worst one, PCA (principal component analysis) does require matrix operations, but it leads to meaningless variables, and the use of variance (an L-2 metric) to measure noise is a bad choice: it is sensitive to outliers. Generally speaking, any solution involving matrices will be unstable when the determinant is close to zero, which happens frequently with variables that are correlated (a common occurrence in practice).

Also, the OP is not talking about eliminating stats entirely, but discussing the fact that if all your focus is on stats 101 (as taught in colleges since 1950), then what you've learned is not data science, and your job prospects are low. Many of the things that data scientists do are indeed real stats, but not recognized as stats by some statisticians. Just like copulas, used by engineers, are real stats but not recognized as stats by some statisticians. If statisticians don't want to acknowledge that some of what we do is real stats (because it's different from what they do), I'm in favor to agree with them and tell them: OK it's not stats, it's data science.

RSS

Videos

  • Add Videos
  • View All

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service

console.log("HostName");