Have you ever done a Google search for mining data? It returns the same results as for data mining. Yet these are two very different keywords: mining data usually means data about mining. And if you search for data about mining you still get the same results anyway.
Yet Google has one of the best search algorithms. Imagine an e-store selling products, allowing users to search for products via a catalog powered with search capabilities, but returning irrelevant results 20% of the time. What a loss of money! Indeed, if you were an investor looking on Amazon to purchase a report on mining data, all you will find are books on data mining and you won't buy anything: possibly a $500 loss for Amazon. Repeat this million times a year, and the opportunity cost is in billions of dollars.
There are a few issues that make this problem difficult to fix. While the problem is straightforward for decision makers, CTO's or CEO's to notice, understand and assess the opportunity cost (just run 200 high value random search queries, see how many return irrelevant results), the communication between the analytic teams and business people is faulty: there is a short somewhere.
There might be multiple analytics teams working as silos - computer scientists, statisticians, engineers - sometimes aggressively defending their own turfs and having conflicting opinions. What the decision makers eventually hears is a lot of noise and lots of technicalities, and they don't know how to start, how much it will cost to fix it, and how complex the issue is, and who should fix it.
Here I discuss the solution and explain it in very simple terms, to help any business having a search engine and an analytic team, easily fix the issue.
Any search query entered by a user is processed using the following steps:
If you search on Google for data mine, you actually get the same results as if you searched for mining data or data mining. If you search for IT jobs, you get all jobs. In my case, all the search results for IT jobs are jobs in Issaquah, WA (where I live). Google correctly identified that I live in Issaquah (via my IP address), but wrongly assumed that I am interested in local jobs, and forgot that I'm only interested in IT jobs. The first job ad I clicked on was for insurance agent.
Why is it so difficult to fix?
The teams involved in steps 1-8 (above) are all different teams or people working independently. The decision makers don't even know who to talk to. Sometimes because of terminology: scientists use words such as n-grams to discuss step 4, and mention computational complexity as the bottleneck preventing step 4 from being improved. They will tell decision makers that a query involving 3 tokens has 6 n-grams, one involving 4 tokens has 24 n-grams, one with 6 tokens has 720 n-grams, and if we need to consider all combinations of tokens in the keyword index, the index will EXPLODE. Maybe the decision maker will say OK, let's buy tons of servers and a clustered architecture (Hadoop).
If you are not familiar with the terminology, a token is a term in a user query. The keyword car insurance Alabama has 3 tokens. N-grams are combinations of these tokens, for instance insurance Alabama car or Alabama insurance car.
When scientists claim that the keyword index will explode, they are wrong. They are wrong because they think like scientists, not like business people. In practice, how many queries have more than three tokens? Very few. And for queries with three or four tokens, very few n-grams have decent traffic - the vast majority can be ignored and replaced by the dominant n-gram (in terms of traffic). For instance, people sometimes search for car insurance but rarely for insurance car. So having only one entry for these two queries make sense. But for data mining and mining data, it makes sense to keep two separate entries. So instead of a table explosion by a factor 1,000, in reality we see an increase in size of about 50% - perfectly manageable. When users search for mining data, the algorithm should first check whether there is one entry for this specific n-gram (there should be one), and if not, then proceed to rearrange tokens in alphabetical order, that is, look for data mining in the keyword index. Note that the keyword index should contain keywords up to 4 tokens long, no more.
Another fix is about step 2. Sometimes, for some words (like mining or booking) you can not stem (replace them by mine or book). You need to have a look-up table of the top 10,000 words ending with -ing that can not be stemmed. Same for step 3: have a look-up table of top 10,000 combinations involving a stop word, where stop word can not be removed: IT job would be in the look-up table in question, attached to stop word it. As a rule of thumb, don't remove stop words in queries consisting of only two tokens.
What is data science?
The issue and solution proposed here is exactly why the data science position was created for: to bridge the gap between statisticians, business people, computer scientists and engineers, frequently to deal with big data problem as in this example. Note that in order to come with a solution like mine, the data scientist must have solid domain expertise, in this case, in search technology.
Job interview question: Should removing stop words be Step 1 rather than Step 3? Answer: Have you thought about the fact that mine and yours could also be stop words? So in a bad implementation, data mining would become data mine after stemming, then data. In practice, you remove stop words before stemming. So Step 3 should indeed become step 1.