Subscribe to DSC Newsletter

Who are alike? Use BigObject feature vector to find similarities

Cluster Analysis is a common technique to group a set of objects in the way that the objects in the same group share certain attributes. It’s commonly used in marketing and sales planning to define market segmentations.


Here at BigObject we adopt a simple approach to exploring the similarities between objects. We simply calculate the “Feature Vector” based on given attributes and use the score to determine which objects are “alike.”
This is a simple example to show how to use BigObject to extract product features and then find similar products in your retail data. We use the default sample data in the BigObject docker image to demonstrate the task. You may run the docker image on your own computer or play around in our sandbox.


The sample data schema is:

For example, we would like to extract all products’ feature based on the average quantity sold in each channel.
First, build a table “avg_qty_by_channel” to store product’s average quantity sold in all channel by the statement:

BUILD TABLE avg_qty_by_channel AS (FIND Product.id, channel_name , avg(qty) FROM sales)

The table “avg_qty_by_channel” would be:

Now, convert the “avg_qty_by_channel” table to a product feature vector table by the trans-pivot operation

BUILD TABLE Product_feature(*, channel_name[*]:'AVG(qty)') FROM avg_qty_by_channel {default_type:DOUBLE}

The feature table “Product_feature” would be:

Finally, we write a simple Lua function which define a distance function (average difference) and scan the product feature table to find the most similar product ( O(n^2) )


The result will be stored in the “simProduct” which created by the statement:

CREATE TABLE simProduct (Product.id STRING, simProductId STRING, distance DOUBLE, KEY(Product.id))

After this, upload a Lua function “findSimP” and run the function by:

APPLY findSimP(Product_feature, simProduct )

It may take some times (80~90 sec.) since the implementation is not an optimal one ( O(n^2) )
The result can be shown by a select statement as:

SELECT * FROM simProduct LIMIT 10

The result would be:

Views: 905

Tags: analytics, clustering, datamining

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service