Subscribe to DSC Newsletter

Calculate Cosine Similarity Using Scipy – Data Sets & Sample Code

What is Cosine Similarity?

Cosine Similarity is a measure of similarity between two vectors that calculates the cosine of the angle between them. Similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity.

We have shared data sets, sample code & an example case study in implementing Cosine Similarity.

The Case:

We are looking to find a place to settle down in California. We like a place called Montecito, CA and want to find similar towns & cities to look for places. How would we go about doing it?

We would first try & understand the factors & variables important to us. Variables could include median value of home, percentage of homes that are 2/3/4 Bedroom, Age of home etc… We prepared a sample set of fields & example values to give you an idea of the variables we used. 

Place in California: Montecito Atherton Tiburon Los_Altos_Hills
Median Home Value: $1000001 $1000001 $1000001 $1000001
% of Homes Built 2000 to 2009: 10 11 8 11
% of Homes Built 1990 to 1999: 11 6 11 10
% of Homes Built 1980 to 1989: 13 5 19 20
% of Homes Built 1970 to 1979: 11 10 28 23
% of Homes Built 1960 to 1969: 16 31 22 18
% of Homes Built 1950 to 1959: 6 13 3 5
% of Homes Built 1940 to 1949: 23 14 4 5
% of Homes Built 1939 or earlier: 10 10 6 9
% of Homes No bed rooms: 2 0 2 0
% of Homes 1 bed rooms: 7 2 9 0
% of Homes 2 bed rooms: 21 3 25 6
% of Homes 3 bed rooms: 37 21 35 20
% of Homes 4 bed rooms: 22 39 19 40
% of Homes 5 or more bed rooms: 11 35 10 35

The Full Data Set to test the Cosine Similarity Algorithms can be downloaded here

To implement the Cosine Similarity algorithm & to test similar locations. You can run the following sample code using SciPy & Python.

Python source code

from scipy import linalg, mat, dot

import numpy as np

import csv

from collections import OrderedDict

import operator

from operator import itemgetter

def int_converter(value):

   try:

       value = int(value)

       return value

   except Exception,e:

       raise e

 

csv_read = csv.DictReader(open('state_analysis.csv','rb'),delimiter='|')

                                                                                                       

california_home_values=[12,11,15,18,14,14,7,10,4,14,28,33,16,4]

cos_sim_dict=OrderedDict()

"""header of the file                                                                                                                           

geoid,place,median value,Built 2000 to 2009,Built 1990 to 1999,Built 1980 to 1989,Built 1970 to 1979,Built 1960 to 1969,Built 1950 to 1959,Buil\

t 1940 to 1949,Built 1939 or earlier,No bed rooms,1 bed rooms,2 bed rooms,3 bed rooms,4 bed rooms,5 or more bed rooms                           

"""

for row in csv_read:

   state=row['place']

   state_home_values = [row['Built 2000 to 2009'],row['Built 1990 to 1999'],row['Built 1980 to 1989'],row['Built 1970 to 1979'],row['Built 196\

0 to 1969'],row['Built 1950 to 1959'],row['Built 1940 to 1949'],row['Built 1939 or earlier'],row['No bed rooms'],row['1 bed rooms'],row['2 bed \

rooms'],row['3 bed rooms'],row['4 bed rooms'],row['5 or more bed rooms']]

   state_home_values = (map(int_converter,state_home_values))

   c = dot(california_home_values,state_home_values)/np.linalg.norm(california_home_values)/np.linalg.norm(state_home_values)

   cos_sim_dict[state]=c

 

sorted_val = sorted(cos_sim_dict.iteritems(),key=operator.itemgetter(1),reverse=True)

similar_city,similarity_score=sorted_val[0]

print "Similar city to California based on the Year built homes and No of Bed rooms is %s and score is %f"%(similar_city,similarity_score)

After running the code, you will see that the closest match to Montecito in California in terms of similar home values, age of homes & size of homes is Atherton, Tiburon & Los Altos Hills all of whom have a 99% similarity score to Montecito. Here are some example similarity scores to Montecito for the three closest locations.

[[0.991, 'Atherton'], [0.99, 'Tiburon'], [0.99, 'Los Altos Hills'],

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 2793

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by leonardo auslender on April 20, 2015 at 3:53am

Thanks, this is the definition of correlation between two variables.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service