We just started in this article to provide answers to one of the largest collection of data science job interview questions ever published, and we will continue to add answers to most of these questions. Some answers link to solutions offered in my Wiley data science book: you can find this book here. The 91 job interview questions were originally published here with no answers, and we recently added 50 questions to identify a true data scientist, in this article. Over time, we will add more questions and more answers. So, bookmark this page!
Other lists of Q&A for data science job interviews (including about R) can be found here.
- What are lift, KPI, robustness, model fitting, design of experiments, and the 80/20 rule? Answer: KPI stands for Key Performance Indicator, or metric, sometimes called feature. A robust model is one that is not sensitive to changes in the data. Design of experiments or experimental design is the initial process used (before data is collected) to split your data, sample and set up a data set for statistical analysis, for instance in A/B testing frameworks or clinical trials. The 80/20 rules means that 80 percent of your income (or results) comes from 20 percent of your clients (or efforts).
- What are collaborative filtering, n-grams, Map Reduce, and cosine distance? Answers: Collaborative filtering/recommendation engines are a set of techniques where recommendations are based on what your friends like, used in social context networks. N-grams are token permutations associated with a keyword, for instance “car insurance California,” “California car insurance,” “insurance California car,” and so on. Map Reduce is a framework to process large data sets, splitting them into subsets, processing each subset on a different server, and then blending results obtained on each. Cosine distance measures how close two sentences are to each other, by counting the number of terms that their share. It does not take into account synonyms, so it is not a precise metric in NLP (natural language processing) contexts. All of these problems are typically solved with Hadoop/Map-Reduce, and the reader should check the index to find and read our various discussions on Hadoop and Map-Reduce in this book.
- What is probabilistic merging (aka fuzzy merging)? Is it easier to handle with SQL or other languages? Which languages would you choose for semi-structured text data reconciliation? Answers: You do a join on two tables A and B, but the keys (for joining) are not compatible. For instance, on A the key is first name/lastname in some char set; in B another char set is used. Data is sometimes missing in A or B. Usually, scripting languages (Python and Perl) are better than SQL to handle this.
- Toad or Brio or any other similar clients are quite inefficient to query Oracle databases. Why? What would you do to increase speed by a factor 10 and be able to handle far bigger outputs?
- What are hash table collisions? How are they avoided? How frequently do they happen?
- How can you make sure a Map Reduce application has good load balance? What is load balance?
- Is it better to have 100 small hash tables or one big hash table in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?
- Why is Naive Bayes so bad? How would you improve a spam detection algorithm that uses Naïve Bayes? Answer: It assumes features are not correlated and thus penalize heavily observations that trigger many features in the context of fraud detection. Solution: use a solution such as hidden decision trees (described in this book) or decorrelate your features.
- What is star schema? What are lookup tables? Answer: The star schema is a traditional database schema with a central (fact) table (the “observations”, with database “keys” for joining with satellite tables, and with several fields encoded as ID’s). Satellite tables map ID’s to physical name or description and can be be “joined” to the central, fact table using the ID fields; these tables are known as lookup tables, and are particularly useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve multiple layers of summarization (summary tables, from granular to less granular) to retrieve information faster.
- Can you perform logistic regression with Excel and if so, how, and would the result be good? Answers: Yes; by using linest on log-transformed data.
- Define quality assurance, six sigma, and design of experiments. Give examples of good and bad designs of experiments.
- What are the drawbacks of general linear model? Are you familiar with alternatives (Lasso, ridge regression, and boosted trees)?
- Do you think 50 small decision trees are better than 1 large one? Why? Answer: Yes, 50 create a more robust model (less subject to over-fitting) and easier to interpret.
- Give examples of data that does not have a Gaussian distribution or log-normal. Give examples of data that has a chaotic distribution. Answer: Salary distribution is a typical examples. Multimodal data (modeled by a mixture distribution) is another example.
- Why is mean square error a bad measure of model performance? What would you suggest instead? Answer: It puts too much emphasis on large deviations. Instead, use the new correlation coefficient described in chapter 4.
- What is a cron job? Answer: A scheduled task that runs in the background on a server.
- What is an efficiency curve? What are its drawbacks, and how can they be overcome?
- What is a recommendation engine? How does it work?
- What is an exact test? How and when can simulations help when you do not use an exact test?
- What is the computational complexity of a good, fast clustering algorithm? What is a good clustering algorithm? How do you determine the number of clusters? How would you perform clustering on 1 million unique keywords, assuming you have 10 million data points—each one consisting of two keywords, and a metric measuring how similar these two keywords are? How would you create this 10 million data points table? Answer: Found in Chapter 2, “Why Big Data Is Different.”
- Should removing stop words be Step 1 rather than Step 3, in the search engine algorithm described in the section “Big Data Problem That Epitomizes the Challenges of Data Science?” Answer: Have you thought about the fact that mine and yours could also be stop words? So in a bad implementation, data mining would become data mine after stemming and then data. In practice, you remove stop words before stemming. So Step 3 should indeed become step 1.
- Should click data be handled in real time? Why? In which contexts? Answer: In many cases, real-time data processing is not necessary and even undesirable as end-of-day algorithms are usually much more powerful. However, real-time processing is critical for fraud detection, as it can deplete advertiser daily budgets in a few minutes.
- What is better: good data or good models? And how do you define good? Is there a universal good model? Are there any models that are definitely not so good? Answer: There is no universal model that is perfect in all situations. Generally speaking, good data is better than good models, as model performance can easily been increased by model testing and cross-validation. Models that are “too good to be true” usually are - they might perform very well on test data (data used to build the model) because of over-fitting, and very badly outside the test data.
- How do you handle missing data? What imputation techniques do you recommend?
- Compare SAS, R, Python, and Perl. Answer: Python is the new scripting language on the block with many libraries constantly added (Pandas for data science), Perl is the old one (still great and the best - very flexible - if you don’t work in a team and don’t have to share your code). R has great visualization capabilities; also R libraries are available to handle data that is too big to fit in memory. SAS is used in a lot of traditional statistical applications; it has a lot of built-in functions, and can handle large data sets pretty past; the most recent versions of SAS have neat programming features and structures such as fast sort and hash tables. Sometimes you must use SAS because that’s what your client is using.
- What is the curse of big data? (See section “The curse of big data” in chapter 2).
- What are examples in which Map Reduce does not work? What are examples in which it works well? What is the security issues involved with the cloud? What do you think of EMC's solution offering a hybrid approach—to both an internal and external cloud—to mitigate the risks and offer other advantages (which ones)? The reader will find more information about key players in this market, in the section “The Big Data Ecosystem” in chapter 2, and especially in the online references (links) provided in this section.
- What is sensitivity analysis? Is it better to have low sensitivity (that is, great robustness) and low predictive power, or vice versa? How to perform good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?
- Compare logistic regression with decision trees and neural networks. How have these technologies been vastly improved over the last 15 years?
- Is actuarial science a branch of statistics (survival analysis)? If not, how so? You can find more information about actuarial sciences at http://en.wikipedia.org/wiki/Actuarial_science and also at http://www.dwsimpson.com/ (job ads, salary surveys).
- What is root cause analysis? How can you identify a cause versus a correlation? Give examples..Answer: in Chapter 5, “Data Science craftsmanship: Part II.”
- How would you define and measure the predictive power of a metric? Answer: in Chapter 6.
- Is it better to have too many false positives, or too many false negatives?
- Have you ever thought about creating a start-up? Around which idea / concept?
- Do you think that a typed login/password will disappear? How could they be replaced?
- What do you think makes a good data scientist?
- Do you think data science is an art or a science?
- Give a few examples of best practices in data science.
- What could make a chart misleading, or difficult to read or interpret? What features should a useful chart have?
- Do you know a few rules of thumb used in statistical or computer science? Or in business analytics?
- What are your top five predictions for the next 20 years?
- How do you immediately know when statistics published in an article (for example, newspaper) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject? For instance, what do you think about the official monthly unemployment statistics regularly discussed in the press? What could make them more accurate?
- In your opinion, what is data science? Machine learning? Data mining?
Questions About Data Science Projects
- How can you detect individual paid accounts shared by multiple users? This is a big problem for publishers, digital newspapers, software developers offering API access, the music / movie industry (file sharing issues) and organizations offering monthly paid access (flat fee not based on the numbers of accesses or downloads) to single users, to view or download content, as it results in revenue losses.
- How can you optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases? Answer: Putting a time-out threshold to less than 2 seconds, extracting no more than the first 20 KB of each pages, and not revisiting pages already crawled (that is, avoid recurrence in your algorithm) are good starting points.
- How would you come up with a solution to identify plagiarism?
- You are about to send 1 million e-mails (marketing campaign). How do you optimize delivery? How do you optimize response? Can you optimize both separately? Answer: Not really.
- How would you turn unstructured data into structured data? Is it really necessary? Is it OK to store data as flat text files rather than in a SQL-powered RDBMS?
- How would you build nonparametric confidence intervals, for scores?
- How can you detect the best rule set for a fraud detection scoring technology? How can you deal with rule redundancy, rule discovery, and the combinatorial nature of the problem (for finding optimum rule set—the one with best predictive power)? Can an approximate solution to the rule set problem be OK? How would you find an OK approximate solution? How would you decide it is good enough and stop looking for a better one?
- How can you create keyword taxonomies?
- What is a botnet? How can it be detected?
- How does Zillow's algorithm work to estimate the value of any home in the United States?
- How can you detect bogus reviews or bogus Facebook accounts used for bad purposes?
- How would you create a new anonymous digital currency? Please focus on the aspects of security - how do you protect this currency against Internet pirates? Also, how do you make it easy for stores to accept it? For instance, each transaction has a unique ID used only once and expiring after a few days; and the space for unique ID’s is very large to prevent hackers from creating valid ID’s just by chance.
- You design a robust nonparametric statistic (metric) to replace correlation or R square that is independent of sample size, always between –1 and +1 and based on rank statistics. How do you normalize for sample size? Write an algorithm that computes all permutations of n elements. How do you sample permutations (that is, generate tons of random permutations) when n is large, to estimate the asymptotic distribution for your newly created metric? You may use this asymptotic distribution for normalizing your metric. Do you think that an exact theoretical distribution might exist, and therefore, you should find it and use it rather than wasting your time trying to estimate the asymptotic distribution using simulations?
- Here’s a more difficult, technical question related to previous one. There is an obvious one-to-one correspondence between permutations of n elements and integers between 1 and factorial n. Design an algorithm that encodes an integer less than factorial n as a permutation of n elements. What would be the reverse algorithm used to decode a permutation and transform it back into a number? Hint: An intermediate step is to use the factorial number system representation of an integer. You can check this reference online to answer the question. Even better, browse the web to find the full answer to the question. (This will test the candidate's ability to quickly search online and find a solution to a problem without spending hours reinventing the wheel.)
- How many “useful” votes will a Yelp review receive? Answer: Eliminate bogus accounts or competitor reviews. (How to detect them: Use taxonomy to classify users, and location. Two Italian restaurants in same ZZIP code could badmouth each other and write great comments for themselves.) Detect fake likes: Some companies (for example, FanMeNow.com) will charge you to produce fake accounts and fake likes. Eliminate prolific users who like everything; those who hate everything. Have a blacklist of keywords to filter fake reviews. See if an IP address or IP block of a reviewer is in a blacklist such as Stop Forum Spam. Create a honeypot to catch fraudsters. Also watch out for disgruntled employees badmouthing their former employer. Watch out for two or three similar comments posted the same day by three users regarding a company that receives few reviews. Is it a new company? Add more weight to trusted users. (Create a category of trusted users.) Flag all reviews that are identical (or nearly identical) and come from the same IP address or the same user. Create a metric to measure the distance between two pieces of text (reviews). Create a review or reviewer taxonomy. Use hidden decision trees to rate or score review and reviewers.
- Can you estimate and forecast sales for any book, based on Amazon.com public data? Hint: See http://www.fonerbooks.com/surfing.htm.
- This is a question about experimental design (and a bit of computer science) with Legos. Say you purchase two sets of Legos (one set to build a car A and a second set to build another car B). Now assume that the overlap between the two sets is substantial. There are three different ways that you can build the two cars. The first step consists in sorting the pieces (Legos) by color and maybe also by size. The three ways to proceed are (1) Sequentially: Build one car at a time. This is the traditional approach. (2) Semi-parallel system: Sort all the pieces from both sets simultaneously so that the pieces will be blended. Some in the red pile will belong to car A; some will belong to car B. Then build the two cars sequentially, following the instructions in the accompanying leaflets. (3) In parallel: Sort all the pieces from both sets simultaneously, and build the two cars simultaneously, progressing simultaneously with the two sets of instructions. Which is the most efficient way to proceed? Answer: The least efficient is sequential. Why? If you are a good at multitasking, the full parallel approach is best. Why? Note that with the semi-parallel approach, the first car that you build will take less time than the second car (due to easier way to find the pieces that you need because of higher redundancy) and less time than needed if you used the sequential approach (for the same reason). To test these assumptions and help you become familiar with the concept of distributed architecture, you can have your kid build two Lego cars, A and B, in parallel and then two other cars, C and D, sequentially. If the overlap between A and B (the proportion of Lego pieces that are identical in both A and B) is small, then the sequential approach will work best. Another concept that can be introduced is that building an 80-piece car takes more than twice as much time as building a 40-piece car. Why? (The same also applies to puzzles).
- How can you design an algorithm to estimate the number of LinkedIn connections your colleagues have? The number is readily available on LinkedIn for your first-degree connections with up to 500 LinkedIn connections. However, if the number is above 500, it just says 500+. The number of shared connections between you and any of your LinkedIn connections is always available. Answer: A detailed algorithm can be found in the book, in the section “job interview questions”.
- How would proceed to reverse engineer the popular BMP image format and create your own images byte by byte, with a programming language? Answer: The easiest BMP format is the 24-bit format, where 1 pixel is stored on 3 bits, the RGB (red/green/blue) components of the pixel, which determines its pixel. After producing a few sample images, it becomes obvious that the format starts with a 52-bit header. Starts with a blank image. Modify the size. This will help you identify where and how the size is encoded in the 52-bit header. On a test image (for example, 40 * 60 pixels), change the color of one pixel. See what happens to the file: Three bytes have a new value, corresponding to the red, green, and blue components of the color attached to your modified pixel. Now you can understand the format and play with it any way you want with low-level access.
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge