Home » Uncategorized

Building an end-end search engine

“Information is the oil of the 21st century, and analytics is the combustion engine” Peter Sondergaard, SVP, Gartner Research

In analytics, we retrieve information from various data sources; it can be structured or unstructured. The biggest challenge here is to retrieve information from unstructured data mainly texts. Here machine learning comes into the picture to overcome this challenge. Different algorithms have been designed in different platforms but here we will discuss one technique that can be applied in python.

The process can be explained better by an example. When we have certain queries and we need to find their solutions or when we require some information from bulk of documents or websites, we put our query to the system and it returns the closest options to our requirements. This algorithm helps in semantic searching. Semantic search seeks to improve search accuracy by understanding the searcher’s intent and the contextual meaning of terms as they appear in the searchable data environment, whether on the Web or within a closed system, to generate more relevant results. An example of contextual search is like we find Barack Obama’s wiki page as our 1st query then 2nd query is “who is his wife”, the result is Michelle Obama’s wiki page and semantic search is applied if query is “big dataset”, web pages with “large datasets” will also come.

Through continuous advancement in analytics and also difference in work scopes, data mining and research are very basic and important so to provide some help for this task – Information Retrieval is among the better ones. Studying data, searching for various files that contain relevant information and its repetitive requirements made this algorithm very useful. The chart below explains this algorithm’s process flow

Building an end-end search engine

When we are dealing with text data (using this one scenario for better understanding), we have data in an unstructured manner so to perform analytical exercise we convert the data into structural form by using a process called vectorizer. In python, the sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image. Also can transform a count matrix to a normalized tf or tf-idf representation, where tf means term-frequency and tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

Along with tf–idf vectorizer feature we also have algorithms like stemming and lemmatization. Documents are going to use different forms of a word, such as organize, organizes, and organizing, additionally; there are families of related words with similar meanings, such as democracy, democratic, and democratization. It would be useful for a search for one of these words to return documents that contain another word in the set. If confronted with the token ‘saw’, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. This is explained in detail here, and more about this subject can be found here.

A snippet of python codes is shared below for processing a query to provide the list of documents related or closer to query and it is limited to text matching only to highlight backend methods of the process with neither semantic or context search engines.

import numpy as np fromsklearn.feature_extraction.textimportTfidfVectorizer tf=TfidfVectorizer(analyzer='word') dataname =tf.fit_transform(data['column name']) #transformed to feature matrix #data is the data set 

Now on this transformed dataset we can build a function by applying cosine similarity logic and indexing to get the results with respect to the query entered:

from sklearn.metrics.pairwise import cosine_similarity defret_info(query,results):  #’ret_info’ is a given function name, can be changed query_vector=tf.transform ([query]) Similarities=cosine_similarity (query_vector,”dataname”) similarity_list=np.ndarray.tolist (Similarities) similarities_idx=np.argsort (similarity_list) similarities_idx=similarities_idx [:,::-1] similar_idx=similarities_idx.tolist () [0] [0:results] Names=[] foridxinsimilar_idx: Names.append (data['name'][idx]) returnnames # return relevant documents closer or match to the query 

Hence with Python, running this algorithm will ease the effort of collecting data or information for specific purpose. The best thing of Information Retrieval is its applicability in various contexts though it can be text data retrieval, image retrieval, etc. To explore more about information Retrieval based on contextual and semantic features these links can be useful –https://github.com/josephwilk/semanticpy,http://www.opensemanticsearch.org/dev/enhancer/python

References https://miguelmalvarez.files.wordpress.com/2016/04/9-free-books-for… ,http://article.sapub.org/image/10.5923.j.se.20120202.04_001.gif

Originally posted here