Document Similarity Analysis Using ElasticSearch and Python

Elasticsearch is an open source search engine based on Lucene. Its being used by leaders in the market like Wikipedia, Linkedin, ebay etc. It has an official python client elasticsearch-py


You can download elasticsearch from here. To install, just unzip the downloaded file and run bin/elasticsearch.bat if you are installing on windows. To install on unix systems you have to run bin/elasticsearch. Visit the page http://localhost:9200/ in your browser to see if its installed properly


Further information about installation and setup of elasticsearch can be found here


The python client can be installed by running


pip install elasticsearch


The process of generating cosine similarity score for documents using elastic search involves following steps

  1. Creating an index

  2. Index the individual documents

  3. Search and get the matched documents and term vectors for a document

  4. Calculate cosine similarity score using the term vectors


Creating an index


An index is like a ‘database’ in a relational database. It is a type of data organization mechanism, allowing the user to partition data a certain way. Elasticsearch also uses index to decide how to distribute data around the cluster.


Unlike the databases of RDBMS indices are light, so you can create hundreds of indices without running into any problems


The following is the code to create an index


es = elasticsearch.Elasticsearch()


Initializes the elasticsearch client


and the call es.indices.create call actually creates the index

here we pass the index name parameter and the body parameter which contains various settings and mappings to configure the index


def create_or_clear_index(_obj):

   index_name = _obj

   es = elasticsearch.Elasticsearch()


   # Delete index if already found one


       es.indices.delete(index = index_name)

   except elasticsearch.NotFoundError:



   # Setup fresh index and mapping

   es.indices.create(index = index_name,

                     body = {

                         "mappings": {

                             "page": {

                                 "_source": { "enabled": True },

                                 "properties": {

                                     "url": {

                                         "type": "string"


                                     "page_text": {

                                         "type": "string",

                                         "term_vector": "yes"


                                     "title": {

                                         "type": "string",

                                         "term_vector": "yes"






Index individual documents


index updates a json document into a named index and hence makes it searchable

Here we to pass the index name, type of the document and document itself


def index_the_text(inp):

            page_title, text_data = inp            


                   es.index(index = idxname,

                            doc_type = "page",

                            id = page_title,

                            body = {

                                "title": page_title,

                                "page_text": text_data


                   print "-" * 10

               except Exception, e:

                   print e


The above function takes a tuple containing the page_title and text as input parameters


Just for the sake of this problem we assume the title of the document is a unique identifier and we index it as the id of the document


Get the 5 most similar documents for every document


To get the most similar documents we use mlt, which stands for “more like this”. Using this api call we get the documents that are “like” the reference document we pass


mlts = es.mlt(index=index_name, doc_type="page",

                         id=doc_id, mlt_fields="page_text",



The parameter “mlt_fields” specifies the exact fields to perform the query against

and the search_size parameter specifies the number of documents to return. Apart from these you can also specify stop words and many other parameters


Get the term vectors

tvjson = es.termvector(index=index_name, doc_type="page",



The above call is used to get the information and statistics about various terms in the fields of a particular document. We need this data to calculate the cosine similarity score of the documents


to get term vectors from the statistics returned we write a function


def get_tv_dict(tvjson):

   return dict([ (k, v['term_freq'])  

                 for k,v in tvjson\






Once we get the term vectors for documents we can calculate the cosine similarity score

Given below is the function to calculate the cosine similarity score of documents given the term vectors


Calculate the cosine similarity score

def get_cosine(vec1, vec2):

   intersection = set(vec1.keys()) & set(vec2.keys())

   numerator = sum([vec1[x] * vec2[x] for x in intersection])


   sum1 = sum([vec1[x]**2 for x in vec1.keys()])

   sum2 = sum([vec2[x]**2 for x in vec2.keys()])

   denominator = math.sqrt(sum1) * math.sqrt(sum2)


   if not denominator:

       return 0.0


       return float(numerator) / denominator


The entire code is given here. It takes a set of urls from a file called “urls_file.txt” crawls them indexes them and for each document indexed fetches the nearest documents and outputs the cosine similarity score into a file “output.csv” in the current directory. Please note that lxml has to be installed for this script to run


Views: 18515

Tags: elasticsearch


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service