Elasticsearch is an open source search engine based on Lucene. Its being used by leaders in the market like Wikipedia, Linkedin, ebay etc. It has an official python client elasticsearch-py
You can download elasticsearch from here. To install, just unzip the downloaded file and run bin/elasticsearch.bat if you are installing on windows. To install on unix systems you have to run bin/elasticsearch. Visit the page http://localhost:9200/ in your browser to see if its installed properly
Further information about installation and setup of elasticsearch can be found here
The python client can be installed by running
pip install elasticsearch
The process of generating cosine similarity score for documents using elastic search involves following steps
Creating an index
Index the individual documents
Search and get the matched documents and term vectors for a document
Calculate cosine similarity score using the term vectors
Creating an index
An index is like a ‘database’ in a relational database. It is a type of data organization mechanism, allowing the user to partition data a certain way. Elasticsearch also uses index to decide how to distribute data around the cluster.
Unlike the databases of RDBMS indices are light, so you can create hundreds of indices without running into any problems
The following is the code to create an index
es = elasticsearch.Elasticsearch()
Initializes the elasticsearch client
and the call es.indices.create call actually creates the index
here we pass the index name parameter and the body parameter which contains various settings and mappings to configure the index
def create_or_clear_index(_obj):
index_name = _obj
es = elasticsearch.Elasticsearch()
# Delete index if already found one
try:
es.indices.delete(index = index_name)
except elasticsearch.NotFoundError:
pass
# Setup fresh index and mapping
es.indices.create(index = index_name,
body = {
"mappings": {
"page": {
"_source": { "enabled": True },
"properties": {
"url": {
"type": "string"
},
"page_text": {
"type": "string",
"term_vector": "yes"
},
"title": {
"type": "string",
"term_vector": "yes"
}
}
}
}})
Index individual documents
index updates a json document into a named index and hence makes it searchable
Here we to pass the index name, type of the document and document itself
def index_the_text(inp):
page_title, text_data = inp
try:
es.index(index = idxname,
doc_type = "page",
id = page_title,
body = {
"title": page_title,
"page_text": text_data
})
print "-" * 10
except Exception, e:
print e
The above function takes a tuple containing the page_title and text as input parameters
Just for the sake of this problem we assume the title of the document is a unique identifier and we index it as the id of the document
Get the 5 most similar documents for every document
To get the most similar documents we use mlt, which stands for “more like this”. Using this api call we get the documents that are “like” the reference document we pass
mlts = es.mlt(index=index_name, doc_type="page",
id=doc_id, mlt_fields="page_text",
search_size=5)
The parameter “mlt_fields” specifies the exact fields to perform the query against
and the search_size parameter specifies the number of documents to return. Apart from these you can also specify stop words and many other parameters
Get the term vectors
tvjson = es.termvector(index=index_name, doc_type="page",
id=doc_id)
The above call is used to get the information and statistics about various terms in the fields of a particular document. We need this data to calculate the cosine similarity score of the documents
to get term vectors from the statistics returned we write a function
def get_tv_dict(tvjson):
return dict([ (k, v['term_freq'])
for k,v in tvjson\
.get('term_vectors')\
.get('page_text')\
.get('terms')\
.iteritems()])
Once we get the term vectors for documents we can calculate the cosine similarity score
Given below is the function to calculate the cosine similarity score of documents given the term vectors
Calculate the cosine similarity score
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
The entire code is given here. It takes a set of urls from a file called “urls_file.txt” crawls them indexes them and for each document indexed fetches the nearest documents and outputs the cosine similarity score into a file “output.csv” in the current directory. Please note that lxml has to be installed for this script to run
Posted 1 March 2021
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central