Subscribe to DSC Newsletter

50 Questions to Test True Data Science Knowledge

This was the subject of a popular discussion recently posted on Quora: 20 questions to detect a fake data scientist. We asked our own data scientist, and he came up with a very different set of questions: compare his answer (#1 below - 20 questions) with Quora replies (#2 and #3 below - 30 questions). Note that #2 focuses on statistics, and #3 on architecture. The link to the original Quora discussion is also provided in this article. Which questions would you add or remove?

Many other related interview questions (data science, R, Python and so on) can be found here.

1. Answer from our data scientist (many of these questions are open questions):

  1. What is the life cycle of a data science project?
  2. How do you measure yield (over base line) resulting from a new or refined algorithm or architecture?
  3. What is cross-validation? How to do it right?
  4. Is it better to design robust or accurate algorithms?
  5. Have you written production code? Prototyped an algorithm? Created a proof of concept?
  6. What is the biggest data set you have worked with, in terms of training set size, and in terms of having your algorithm implemented in production mode to process billions of transactions per day / month / year?
  7. Name a few famous API's (for instance Google search). How would you create one?
  8. How to efficiently scrape web data, or collect tons of tweets?
  9. How to optimize algorithms (parallel processing and/or faster algorithm: provide examples for both)
  10. Examples of NoSQL architecture?
  11. How do you clean data?
  12. How do you define / select metrics? Have you designed and used compound metrics?
  13. Examples of bad and good visualizations?
  14. Have you been involved - as an adviser or architect - in the design of dashboard or alarm systems?
  15. How frequently an algorithm must be updated? What about lookup tables in real-time systems?
  16. Provide examples of machine-to-machine communication.
  17. Provide examples where you automated a repetitive analytical task.
  18. How do you assess the statistical significance of an insight?
  19. How to turn unstructured data into structured data?
  20. How to very efficiently cluster 100 billion web pages, for instance with a tagging or indexing algorithm? 
  21. If you were interviewing a data scientist, what questions would you ask her?

2. This answer was posted on Quora by Jay Verkuilen:

  1. Explain what regularization is and why it is useful. What are the benefits and drawbacks of specific methods, such as ridge regression and LASSO?
  2. Explain what a local optimum is and why it is important in a specific context, such as k-means clustering. What are specific ways for determining if you have a local optimum problem? What can be done to avoid local optima?
  3. Assume you need to generate a predictive model of a quantitative outcome variable using multiple regression. Explain how you intend to validate this model.
  4. Explain what precision and recall are. How do they relate to the ROC curve?
  5. Explain what a long tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and prediction problems?
  6. What is latent semantic indexing? What is it used for? What are the specific limitations of the method?
  7. What is the Central Limit Theorem? Explain it. Why is it important? When does it fail to hold?
  8. What is statistical power?
  9. Explain what resampling methods are and why they are useful. Also explain their limitations.
  10. Explain the differences between artificial neural networks with softmax activation, logistic regression, and the maximum entropy classifier.
  11. Explain selection bias (with regards to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?
  12. Provide a simple example of how an experimental design can help answer a question about behavior. For instance, explain how an experimental design can be used to optimize a web page. How does experimental data contrast with observational data.
  13. Explain the difference between "long" and "wide" format data. Why would you use one or the other?
  14. Is mean imputation of missing data acceptable practice? Why or why not?
  15. Explain Edward Tufte's concept of "chart junk." 
  16. What is an outlier? Explain how you might screen for outliers and what you would do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what you would do if you found them in your dataset.
  17. What is principal components analysis (PCA)? Explain the sorts of problems you would use PCA for. Also explain its limitations as a method.
  18. You have data on the duration of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test (even graphically) whether your expectations are borne out?
  19. Explain what a false positive and a false negative are. Why is it important to differentiate these from each other? Provide examples of situations where (1) false positives are more important than false negatives, (2) false negatives are more important than false positives, and (3) these two types of errors are about equally important.
  20. Explain likely differences encountered between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problems do they bring?

3. Kavita Ganesan offered this answer:

  1. What is a gold standard ? 
    Believe it or not there are data scientists (even at very senior levels) who claim to know a hell lot about supervised machine learning and know nothing about what a gold standard is!
  2. What is the difference between supervised learning and unsupervised learning? - Give concrete examples.
  3. What does NLP stand for?
    Some data scientists claim to also do NLP.  
  4. Write code to count the number of words in a document using any programming language. Now, extend this for bi-grams.
    I have seen a senior level data scientist who actually struggled to implement this. 
  5. What are feature vectors?
  6. When would you use SVMs vs Random Forrest and Why?
  7. What is your definition of Big Data, and what is the largest size of data you have worked with? Did you parallelize your code?
    If their notion of big data is just volume - you may have a problem. Big Data is more than just volume of data. If the largest size of data they have worked with is 5MB - again you may have a problem.
  8. How do you work with large data sets?
    If the answer only comes out as hadoop it clearly shows that their view of solving problems is extremely narrow. Large data problems can be solved with:
    1. efficient algorithms
    2. multi-threaded applications
    3. distributed programming
    4. more...
  9. Write a mapper function to count word frequencies (even if its just pseudo code)
  10. Write a reducer function for counting word frequencies (even if its just pseudo code)

You can find the original contribution here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 18948

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Dalila Benachenhou on July 7, 2016 at 7:26am
  • How do you clean data?

Are we talking about unstructured (text) or structured data.

  • Explain what a local optimum is and why it is important in a specific context, such as k-means clustering. What can be done to avoid local optima?  Change the starting conditions (change the choice of the initial centroids)
  • What are specific ways for determining if you have a local optimum problem? 
  • Analysis the within distance between elements in a cluster and the cluster centroid, and the distance between the centroid (don't choose your K a-priori)

What is your definition of Big Data?

In my opinion one cannot call population income a Big Data problem.  You may have income of 24 million, but you really need only a small sample to get income distribution, and get a summary statistics.  It may be an IT Big Data problem but it just requires Small Data analysis.

In some cases, you may need to use Big Data technology to build a ML model, but after the model is build and put in production the needs goes away.

Comment by Dalila Benachenhou on July 7, 2016 at 6:51am
  1. Is it better to design robust or accurate algorithms? This is the same question that a CS or an engineer deal with.   In both cases  you strive to achieve both, but you know that you can always design a robust algorithm but you cannot always design an accurate algorithm or care about designing an accurate algorithm.  If the purpose is to understand the relation between predictors and response variables (for instance, try to get a profile of potential clients) accuracy of the algorithm is not important.  On the other hand, for critical matters, having a helicopter that lands itself, or having a rice cooker cook rice at perfection, accuracy is crucial.
Comment by Xavier Sumba on January 12, 2016 at 12:24pm

For those who are interested in interview questions. There is a github page [1] with lists of interview questions. I have added a new category for Data Science. You can look for another questions if you are interested.

[1] https://github.com/MaximAbramchuck/awesome-interviews

Comment by Matei Beremski on January 11, 2016 at 12:15pm

I was referring to yours :) Stark difference between your questions and those on Quora. Interesting to see that the focus of the data science community is still very much on modelling and algorithm choice.

Comment by Sione Palu on January 11, 2016 at 12:05pm

Question 14 :   Is mean imputation of missing data acceptable practice? Why or why not?

Regression task for temporal based data, imputation is necessary.  Favorites imputation pre-processing for me in time-series analysis is to impute via Kalman-Filter Smoothing. Once done, then other steps like noise removal can be applied then do forecasting afterwards.

Comment by Vincent Granville on January 11, 2016 at 9:29am

I'd love to provide the answers, at least to my questions. Finding time is difficult. 

Comment by Matei Beremski on January 11, 2016 at 2:56am

Thank you for sharing this. Great Questions. 

If you have time Vincent, it would also be great to see what you think make good responses!

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2016   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service