Subscribe to DSC Newsletter

Document Similarity Analysis Using ElasticSearch and Python

Elasticsearch is an open source search engine based on Lucene. Its being used by leaders in the market like Wikipedia, Linkedin, ebay etc. It has an official python client elasticsearch-py

 

You can download elasticsearch from here. To install, just unzip the downloaded file and run bin/elasticsearch.bat if you are installing on windows. To install on unix systems you have to run bin/elasticsearch. Visit the page http://localhost:9200/ in your browser to see if its installed properly


 

Further information about installation and setup of elasticsearch can be found here

 

The python client can be installed by running

 

pip install elasticsearch


 

The process of generating cosine similarity score for documents using elastic search involves following steps

  1. Creating an index

  2. Index the individual documents

  3. Search and get the matched documents and term vectors for a document

  4. Calculate cosine similarity score using the term vectors


 

Creating an index

 

An index is like a ‘database’ in a relational database. It is a type of data organization mechanism, allowing the user to partition data a certain way. Elasticsearch also uses index to decide how to distribute data around the cluster.

 

Unlike the databases of RDBMS indices are light, so you can create hundreds of indices without running into any problems

 

The following is the code to create an index

 

es = elasticsearch.Elasticsearch()

 

Initializes the elasticsearch client

 

and the call es.indices.create call actually creates the index

here we pass the index name parameter and the body parameter which contains various settings and mappings to configure the index

 

def create_or_clear_index(_obj):

   index_name = _obj

   es = elasticsearch.Elasticsearch()

   

   # Delete index if already found one

   try:

       es.indices.delete(index = index_name)

   except elasticsearch.NotFoundError:

       pass

       

   # Setup fresh index and mapping

   es.indices.create(index = index_name,

                     body = {

                         "mappings": {

                             "page": {

                                 "_source": { "enabled": True },

                                 "properties": {

                                     "url": {

                                         "type": "string"

                                     },

                                     "page_text": {

                                         "type": "string",

                                         "term_vector": "yes"

                                     },

                                     "title": {

                                         "type": "string",

                                         "term_vector": "yes"

                                     }

                                 }

                             }

                         }})




 

Index individual documents

 

index updates a json document into a named index and hence makes it searchable

Here we to pass the index name, type of the document and document itself


 

def index_the_text(inp):

            page_title, text_data = inp            

               try:

                   es.index(index = idxname,

                            doc_type = "page",

                            id = page_title,

                            body = {

                                "title": page_title,

                                "page_text": text_data

                            })

                   print "-" * 10

               except Exception, e:

                   print e

 

The above function takes a tuple containing the page_title and text as input parameters

 

Just for the sake of this problem we assume the title of the document is a unique identifier and we index it as the id of the document

 

Get the 5 most similar documents for every document

 

To get the most similar documents we use mlt, which stands for “more like this”. Using this api call we get the documents that are “like” the reference document we pass

 

mlts = es.mlt(index=index_name, doc_type="page",

                         id=doc_id, mlt_fields="page_text",

                         search_size=5)

 

The parameter “mlt_fields” specifies the exact fields to perform the query against

and the search_size parameter specifies the number of documents to return. Apart from these you can also specify stop words and many other parameters


 

Get the term vectors

tvjson = es.termvector(index=index_name, doc_type="page",

                                  id=doc_id)

 

The above call is used to get the information and statistics about various terms in the fields of a particular document. We need this data to calculate the cosine similarity score of the documents

 

to get term vectors from the statistics returned we write a function

 

def get_tv_dict(tvjson):

   return dict([ (k, v['term_freq'])  

                 for k,v in tvjson\

                 .get('term_vectors')\

                 .get('page_text')\

                 .get('terms')\

                 .iteritems()])

 

Once we get the term vectors for documents we can calculate the cosine similarity score

Given below is the function to calculate the cosine similarity score of documents given the term vectors

 

Calculate the cosine similarity score

def get_cosine(vec1, vec2):

   intersection = set(vec1.keys()) & set(vec2.keys())

   numerator = sum([vec1[x] * vec2[x] for x in intersection])

   

   sum1 = sum([vec1[x]**2 for x in vec1.keys()])

   sum2 = sum([vec2[x]**2 for x in vec2.keys()])

   denominator = math.sqrt(sum1) * math.sqrt(sum2)

   

   if not denominator:

       return 0.0

   else:

       return float(numerator) / denominator


 

The entire code is given here. It takes a set of urls from a file called “urls_file.txt” crawls them indexes them and for each document indexed fetches the nearest documents and outputs the cosine similarity score into a file “output.csv” in the current directory. Please note that lxml has to be installed for this script to run

 

Views: 10479

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service