I'm working with a dataset with latitude, longitude and date-time, and 5 million points per day.
And I'm coding in Python, with a clickhouse database to store the source data.
Is there a way to do a spatio-temporal clustering that includes the 3 features?
So far I have scaled/normalized the 3 features and use MiniBatchKMeans (current solution used), or an Euclidian distance, but I I'm losing the notion of the physical distance between points.
DBSCAN or HDBSCAN with Havresine is accepting only 2 features (lat lon in radians).
Also the volume rule-out non optimized solution that don't scale (I’ve tried the ST-DBSCAN available on GitHub, I stoped it after 15h run on just 2 hours of data).
And I don't have and expected number of cluster, and depending on the day it should change.
Do you have a better idea?
With three dimensions, you could try to bin the features and perform clustering on the grid using density estimation techniques. It requires some pre-processing, but it is then O(1) to classify a new point, see https://www.datasciencecentral.com/profiles/blogs/variance-clusteri...
I've tried yesterday the binning to reduce the volume :
regroup geographically close points and bin time in 15 min interval.
That allow me to reduce the dataset from five millions per day to around eighty thousand.
That add a new value that is the number of point by row, that Ive named weight but can be seen as a density.
I will look further at your article about density estimation techniques.