python 2.7 - Optimizing a DBSCAN to run computationally -

i running dbscan algorithm in python on dataset (modelled similar http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html , loaded pandas dataframe) has total of ~ 3 million datapoints, across 31 days. further, density clustering find outliers on per day basis, db = dbscan(eps=0.3, min_samples=10).fit(data) have day worth of data-points run on, in each pass. minimum/maximum points have on day 15809 & 182416. tried deleting variables, process gets killed @ dbscan clustering stage.

at o(n log n) bloats up, no matter run it. understand there no way pre-specify number of "labels", or clusters - else best here?
also, optimization point of view, of values of these data points exact (think of these cluster points repeated) - can use information process data ahead of feeding dbscan?
i read this thread on using "canopy preclustering" compress data in vector quantization ahead of dbscan (note method equally expensive computationally) - can use similar pre-process data? or how "parallel dbscan"?

have considered do:

partitioning, cluster 1 day (or less) @ time
sampling, break data set randomly 10 parts. process them individually

Search This Blog

Ben

python 2.7 - Optimizing a DBSCAN to run computationally -

Comments

Post a Comment

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

post - imageshack API cURL -

dataset - MPAndroidchart returning no chart Data available -