Subscribe to DSC Newsletter

Building better search tools: problems and solutions

Have you ever done a Google search for mining data? It returns the same results as for data mining. Yet these are two very different keywords: mining data usually means data about mining. And if you search for data about mining you still get the same results anyway.

Yet Google has one of the best search algorithms. Imagine an e-store selling products, allowing users to search for products via a catalog powered with search capabilities, but returning irrelevant results 20% of the time. What a loss of money! Indeed, if you were an investor looking on Amazon to purchase a report on mining data, all you will find are books on data mining and you won't buy anything: possibly a $500 loss for Amazon. Repeat this million times a year, and the opportunity cost is in billions of dollars.

There are a few issues that make this problem difficult to fix. While the problem is straightforward for decision makers, CTO's or CEO's to notice, understand and assess the opportunity cost (just run 200 high value random search queries, see how many return irrelevant results), the communication between the analytic teams and business people is faulty: there is a short somewhere.

There might be multiple analytics teams working as silos - computer scientists, statisticians, engineers - sometimes aggressively defending their own turfs and having conflicting opinions. What the decision makers eventually hears is a lot of noise and lots of technicalities, and they don't know how to start, how much it will cost to fix it, and how complex the issue is, and who should fix it.

Here I discuss the solution and explain it in very simple terms, to help any business having a search engine and an analytic team, easily fix the issue.

Any search query entered by a user is processed using the following steps:

  1. The user query is cleaned: typos are removed
  2. It is then stemmed: plural replaced by singular, verbs (-ing form) replaced by nouns (thus mining becomes mine), etc.
  3. Stop words are removed: the, or, and, of, from, about, it, with etc. For instance IT jobs becomes jobs, data about mining becomes data mining and because of step 2, data mine.
  4. The query is normalized: the tokens are rearranged in alphabetical order: mine data becomes data mine.
  5. Other algorithms are involved to detect if the query contains a city, a book title, a song, a product, a movie, using look-up tables.
  6. Other algorithms are involved to detect whether two tokens must be attached: San Francisco is one token, not two, even though it looks like two.
  7. Synonym detection. Maybe automobile can be replaced by car, to reduce the size of the indexes.
  8. Algorithm to detect user intention (sometimes based on user profile), for queries spanning across multiple categories.

If you search on Google for data mine, you actually get the same results as if you searched for mining data or data mining. If you search for IT jobs, you get all jobs. In my case, all the search results for IT jobs are jobs in Issaquah, WA (where I live). Google correctly identified that I live in Issaquah (via my IP address), but wrongly assumed that I am interested in local jobs, and forgot that I'm only interested in IT jobs. The first job ad I clicked on was for insurance agent.

Why is it so difficult to fix?

The teams involved in steps 1-8 (above) are all different teams or people working independently. The decision makers don't even know who to talk to. Sometimes because of terminology: scientists use words such as n-grams to discuss step 4, and mention computational complexity as the bottleneck preventing step 4 from being improved. They will tell decision makers that a query involving 3 tokens has 6 n-grams, one involving 4 tokens has 24 n-grams, one with 6 tokens has 720 n-grams, and if we need to consider all combinations of tokens in the keyword index, the index will EXPLODE. Maybe the decision maker will say OK, let's buy tons of servers and a clustered architecture (Hadoop).

If you are not familiar with the terminology, a token is a term in a user query. The keyword car insurance Alabama has 3 tokens. N-grams are combinations of these tokens, for instance insurance Alabama car or Alabama insurance car.

When scientists claim that the keyword index will explode, they are wrong. They are wrong because they think like scientists, not like business people. In practice, how many queries have more than three tokens? Very few. And for queries with three or four tokens, very few n-grams have decent traffic - the vast majority can be ignored and replaced by the dominant n-gram (in terms of traffic). For instance, people sometimes search for car insurance but rarely for insurance car. So having only one entry for these two queries make sense. But for data mining and mining data, it makes sense to keep two separate entries. So instead of a table explosion by a factor 1,000, in reality we see an increase in size of about 50% - perfectly manageable. When users search for mining data, the algorithm should first check whether there is one entry for this specific n-gram (there should be one), and if not, then proceed to rearrange tokens in alphabetical order, that is, look for data mining in the keyword index. Note that the keyword index should contain keywords up to 4 tokens long, no more.

Another fix is about step 2. Sometimes, for some words (like mining or booking) you can not stem (replace them by mine or book). You need to have a look-up table of the top 10,000 words ending with -ing that can not be stemmed. Same for step 3: have a look-up table of top 10,000 combinations involving a stop word, where stop word can not be removed: IT job would be in the look-up table in question, attached to stop word it. As a rule of thumb, don't remove stop words in queries consisting of only two tokens.

What is data science?

The issue and solution proposed here is exactly why the data science position was created for: to bridge the gap between statisticians, business people, computer scientists and engineers, frequently to deal with big data problem as in this example. Note that in order to come with a solution like mine, the data scientist must have solid domain expertise, in this case, in search technology.

Job interview question: Should removing stop words be Step 1 rather than Step 3? Answer: Have you thought about the fact that mine and yours could also be stop words? So in a bad implementation, data mining would become data mine after stemming, then data. In practice, you remove stop words before stemming. So Step 3 should indeed become step 1. 

Related articles

Views: 4734


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Pramod kumar sharma on September 14, 2016 at 6:44pm
Thanks for sharing
Comment by Chintan Donda on November 12, 2015 at 1:20am

Informative post !!

Thanks for sharing.

Comment by Fazil Sheikh on September 16, 2013 at 11:09pm

Very Good Post ! Got Greater Insight as well enjoyed the reading ! Nice work Mr. Vincent !

Comment by David Inbar on September 15, 2013 at 6:07am

Vincent - thanks for opening up yet another thorny topic!  If we want more "intelligent" search results, why not engage the user to clarify their intent rather than rely on (ultimately limited) algorithms?   So, for the mining data example, why not prompt the user to select from data about mining or data mining and then constrain the results?

Comment by Vincent Granville on September 13, 2013 at 11:03am

@BR: Stories (e.g. articles) are easy to detect via step 5. Also, by 4 tokens, I mean after cleaning. For instance, Chinese Restaurants in San Francisco is indeed 3 tokens: Chinese, Restaurant, San_Francisco. In queries with 5 tokens, the token with highest frequency might be dropped, although that would require a specific algorithm.

Comment by BR Deshpande on September 13, 2013 at 10:45am

Vincent - i am not so sure of the assumption that "In practice, how many queries have more than three tokens? Very few." Avinash Kaushik from Google has commented that "people write stories on the google search box". Personally, i never search using just 2 word tokens - whether it is on Google or Amazon. With 4 or 5 tokens the results usually become very unique and relevant.

Comment by Vincent Granville on September 13, 2013 at 4:18am

I think the problem is worse on Amazon than Google (see my comment below) because Amazon has a much smaller inventory of search result pages to choose from. Amazon's inventory is mostly products, e.g. books, while Google's inventory is the entire Internet.

Comment by Vincent Granville on September 12, 2013 at 8:32pm

@Marcel: I can imagine Google benefiting from showing poor results (the default option - that is, 'broad match'). Indeed, it makes users more likely to click on the Google ads, and thus boost Google revenue.

But for Amazon, it does not make sense. Maybe Amazon's search box is powered by Google. If that's the case, then Amazon should configure the default search setting as 'exact match', not 'broad match'. But I tried 'exact match' on "data about mining", and it did not improve the search results on Amazon (it made it even worse!)

So clearly, Amazon has a more severe problem - more severe both from a technical and business point of view.

Comment by Marcel Remon on September 12, 2013 at 8:06pm

If you use 'exact match' in your Google search, you get results slightly better both for "IT jobs", "mining data" and "data about mining". So Google seems to have a solution. I believe the real fix is to make 'exact match' the default option, rather than 'broad match'. How many people know about 'exact match'? Probably less than 1%. Anyway, as you stated, it's more a business than a technical issue.

Comment by Fari Payandeh on September 12, 2013 at 6:10pm

Enjoyed reading the post. Very true!

Follow Us


  • Add Videos
  • View All


© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service