python 2.7 - Optimizing a DBSCAN to run computationally -


i running dbscan algorithm in python on dataset (modelled similar http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html , loaded pandas dataframe) has total of ~ 3 million datapoints, across 31 days. further, density clustering find outliers on per day basis, db = dbscan(eps=0.3, min_samples=10).fit(data) have day worth of data-points run on, in each pass. minimum/maximum points have on day 15809 & 182416. tried deleting variables, process gets killed @ dbscan clustering stage.

  1. at o(n log n) bloats up, no matter run it. understand there no way pre-specify number of "labels", or clusters - else best here?

  2. also, optimization point of view, of values of these data points exact (think of these cluster points repeated) - can use information process data ahead of feeding dbscan?

  3. i read this thread on using "canopy preclustering" compress data in vector quantization ahead of dbscan (note method equally expensive computationally) - can use similar pre-process data? or how "parallel dbscan"?

have considered do:

  • partitioning, cluster 1 day (or less) @ time
  • sampling, break data set randomly 10 parts. process them individually

Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -