Subscribe to DSC Newsletter

 Dear All,

I'm working with a dataset with latitude, longitude and date-time, and 5 million points per day.

And I'm coding in Python, with a clickhouse database to store the source data.


Is there a way to do a spatio-temporal clustering that includes the 3 features?


So far I have scaled/normalized the 3 features and use MiniBatchKMeans (current solution used), or an Euclidian distance, but I I'm losing the notion of the physical distance between points.

DBSCAN or HDBSCAN with Havresine is accepting only 2 features (lat lon in radians).

Also the volume rule-out non optimized solution that don't scale (I’ve tried the ST-DBSCAN available on GitHub, I stoped it after 15h run on just 2 hours of data).

And I don't have and expected number of cluster, and depending on the day it should change.

Do you have a better idea?

Thanks

Tags: clustering, spatio-temporal

Views: 320

Reply to This

Replies to This Discussion

With three dimensions, you could try to bin the features and perform clustering on the grid using density estimation techniques. It requires some pre-processing, but it is then O(1) to classify a new point, see https://www.datasciencecentral.com/profiles/blogs/variance-clusteri...

Hi Vincent,

I've tried yesterday the binning to reduce the volume :

regroup geographically close points and bin time in 15 min interval.

That allow me to reduce the dataset from five millions per day to around eighty thousand.

That add a new value that is the number of point by row, that Ive named weight but can be seen as a density.

I will look further at your article about density estimation techniques.

Thanks.

RSS

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service