Have you ever tried a website’s keyword search and been unsatisfied with the accuracy of the results? Do you find yourself feeling frustrated and leaving when the search doesn’t return what you’re looking for or giving results but not in the order resulting in customer not able to find out? Even worse – do you find yourself just assuming what you’re looking for must not exist on that site – only to find the item on that exact same site through other channels like google search or ads?
If so, you’ve just experienced bad search relevancy. It’s something we all experience daily — a frustration for users and lost opportunity for the sites attempting to serve us.
As you might have recognized by now, this search relevancy is very important in retention of customers as well as providing good user experience to the customers.
Suppose you have an ecommerce website having millions of products. More often than not you are likely to use a text based search engines like Solr or Elasticsearch(I will not get into the merits of using Solr or elasticsearch in this article). For simplicity sake we assume that we have 4 fields of a product. Say name, category, brand and description and we are using name, category and brand fields for searching.
Suppose we have 2 entries in product
16 GB etc
1 mm thickness
Say our backend converts the query as below.
Search text : IPhone
Solr Query : name:iphone OR brand:iphone OR category : Iphone
Common logic says that when user searches iphone his first result should be iphone from mobile category and brand should be Apple. And then iphone cover in accessories category.
Search engines, like Solr or Elasticsearch, are simply sophisticated text matching systems. These tools can tell you when the search word matches a word in the document but they aren’t nearly as smart as human. Once a match is determined a search engine can use statistics about the relative frequency of that word to give a search result a relevancy score. This relevancy score is used to show the order of the results. In the above case keyword iphone does not match any of the entries in category, brand or description. In name field for each of the entry there is one match. Hence its possible that Iphone Cover will come first then the iphone itself.
When a user searches for IPhone he means(more often than not) iphone mobile. His search should translate into
Brand : Apple
Name : Iphone
We will use synonyms analyzer in Solr to solve this relevancy problem.
In Schema.xml we create a field called brand with fieldType as text_synonyms_brand .
<field name="brand" type="text_synonyms_brand" indexed="true" stored="true" default="" />
And we create a fieldType like this
<fieldType name="text_synonyms_brand" class="solr.TextField" positionIncrementGap="100">
generateWordParts="0" generateNumberParts="0" splitOnNumerics="0" catenateWords="0" catenateNumbers="0" catenateAll="1"
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_brand.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory"/>
Look at the synonyms which points to synonyms_brand.txt . Now here you can make an entry as
Similarly create another filed called category with fieldType as text_synonyms_category and create a corresponding fieldType as text_synonyms_category. In the synonyms of this field point to synomyms_category.txt. In synonyms_category.txt make an entry
When the user searches for Iphone the backend converts it into solr query
Name:iphone OR brand:Iphone OR Category: Iphone
With our synomyms it will be parsed at the query time to
Name:Iphone brand:Apple Category:Mobiles
will give you more relevant result.
Query Time Boosts: This is a very useful method to give more scores to a particular field in comparison to matches in other fields. Say if the keyword matches in brand it will carry more weightage in result than a match in name. In Solr we have something like “^”.
Phrase Matching: Boosting(giving more score) on the appearance of the entire string.