This was the subject of a popular discussion recently posted on Quora: 20 questions to detect a fake data scientist. We asked our own data scientist, and he came up with a very different set of questions: compare his answer (#1 below - 20 questions) with Quora replies (#2 and #3 below - 30 questions). Note that #2 focuses on statistics, and #3 on architecture. The link to the original Quora discussion is also provided in this article. Which questions would you add or remove?
Many other related interview questions and answers (data science, R, Python and so on) can be found here.
1. Answer from our data scientist (many of these questions are open questions):
2. This answer was posted on Quora by Jay Verkuilen:
3. Kavita Ganesan offered this answer:
You can find the original contribution here.
DSC Resources
Additional Reading
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
Comment
Question 14 : Is mean imputation of missing data acceptable practice? Why or why not?
Most imputation methods currently used should not be acceptable. Imputation can totally screw up modeling and analysis. In general, imputation should be done as part of a joint optimization with the model being fit and predictions being made. To do otherwise is generally sub-optimal, possibly in a huge way.
If Kavita Ganesan weren't such a pretentious twit, before asking "What does NLP stand for?
Some data scientists claim to also do NLP." maybe she would have found out that long before NLP stood for Natural Language Processing, it stood for, and still does stand for, Nonlinear Programming. Does she realize that Nonlinear Programming is another name for nonlinear optimization, which underlies so many data science and statistical algorithms, and is even used in Natural Language Processing? As an expert in NLP, maybe she should know there is more than one meaning to NLP which is applicable to data science. isn't recognizing multiple possible meanings, in context, part of NLP? These know-it-alls who don't know-it-all really gall me.
Are we talking about unstructured (text) or structured data.
What is your definition of Big Data?
In my opinion one cannot call population income a Big Data problem. You may have income of 24 million, but you really need only a small sample to get income distribution, and get a summary statistics. It may be an IT Big Data problem but it just requires Small Data analysis.
In some cases, you may need to use Big Data technology to build a ML model, but after the model is build and put in production the needs goes away.
For those who are interested in interview questions. There is a github page [1] with lists of interview questions. I have added a new category for Data Science. You can look for another questions if you are interested.
[1] https://github.com/MaximAbramchuck/awesome-interviews
I was referring to yours :) Stark difference between your questions and those on Quora. Interesting to see that the focus of the data science community is still very much on modelling and algorithm choice.
Question 14 : Is mean imputation of missing data acceptable practice? Why or why not?
Regression task for temporal based data, imputation is necessary. Favorites imputation pre-processing for me in time-series analysis is to impute via Kalman-Filter Smoothing. Once done, then other steps like noise removal can be applied then do forecasting afterwards.
I'd love to provide the answers, at least to my questions. Finding time is difficult.
Thank you for sharing this. Great Questions.
If you have time Vincent, it would also be great to see what you think make good responses!
© 2018 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central