python 2.7 - Optimizing a DBSCAN to run computationally -
i running dbscan algorithm in python on dataset (modelled similar http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html , loaded pandas dataframe) has total of ~ 3 million datapoints, across 31 days. further, density clustering find outliers on per day basis, db = dbscan(eps=0.3, min_samples=10).fit(data)
have day worth of data-points run on, in each pass. minimum/maximum points have on day 15809 & 182416. tried deleting variables, process gets killed @ dbscan clustering stage.
at
o(n log n)
bloats up, no matter run it. understand there no way pre-specify number of "labels", or clusters - else best here?also, optimization point of view, of values of these data points exact (think of these cluster points repeated) - can use information process data ahead of feeding dbscan?
i read this thread on using "canopy preclustering" compress data in vector quantization ahead of dbscan (note method equally expensive computationally) - can use similar pre-process data? or how "parallel dbscan"?
have considered do:
- partitioning, cluster 1 day (or less) @ time
- sampling, break data set randomly 10 parts. process them individually
Comments
Post a Comment