scipy - Python nearest neighbour

scipy - Python nearest neighbour - coordinates -

i wanted check using scipy's kd tree correctly because appears slower simple bruteforce.

i had 3 questions regarding this:

q1.

if create following test data:

nplen = 1000000 # wgs84 lat/long point = [51.349,-0.19] # contains wgs84 lat/long points = np.ndarray.tolist(np.column_stack(         [np.round(np.random.randn(nplen)+51,5),          np.round(np.random.randn(nplen),5)]))

and create 3 functions:

def kd_test(points,point):     """ kd tree"""     return points[spatial.kdtree(points).query(point)[1]]  def ckd_test(points,point):     """ c implementation of kd tree"""     return points[spatial.ckdtree(points).query(point)[1]]  def closest_math(points,point):     """ simple angle"""     return (min((hypot(x2-point[1],y2-point[0]),y2,x2) y2,x2 in points))[1:3]

i expect ckd tree fastest, - running this:

print("co-ordinate: ", f(points,point)) print("index: ", points.index(list(f(points,point)))) %timeit f(points,point)

result times - simple bruteforce method faster:

closest_math: 1 loops, best of 3: 3.59 s per loop ckd_test: 1 loops, best of 3: 13.5 s per loop kd_test: 1 loops, best of 3: 30.9 s per loop

is because using wrong - somehow?

q2.

i assume ranking (rather distance) of closest points 1 still needs project data. however, seems projected , un-projected points give me same nearest neighbour:

def proj_list(points,               inproj = proj(init='epsg:4326'),               outproj = proj(init='epsg:27700')):     """ projected geo coordinates"""     return [list(transform(inproj,outproj,x,y)) y,x in points] proj_points = proj_list(points) proj_point = proj_list([point])[0]

is because spread of points not big enough introduce distortion? re-ran few times , still got same index out of projected , un-projected lists being returned.

q3.

is faster project points (like above) , calculate hypotenuse distance compared calculating haversine or vincenty distance on (un-projected) latitude/longitudes? option more accurate? ran small test:

from math import * def haversine(origin,               destination):     """     find distance between pair of lat/lng coordinates     """     lat1, lon1, lat2, lon2 = map(radians, [origin[0],origin[1],destination[0],destination[1]])     dlon = lon2 - lon1     dlat = lat2 - lat1     = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2     c = 2 * asin(sqrt(a))     r = 6371000  # metres     return (c * r)  def closest_math_unproj(points,point):     """ haversine on unprojected """     return (min((haversine(point,pt),pt[0],pt[1]) pt in points))  def closest_math_proj(points,point):     """ simple angle since projected"""     return (min((hypot(x2-point[1],y2-point[0]),y2,x2) y2,x2 in points))

results:

so seems projecting , doing distance faster not - however, not sure method bring more accurate results.

testing on online vincenty calculation seems projected co-ordinates way go:

q1.

the reason apparent inefficiency of k-d tree quite simple: measuring both construction , querying of k-d tree @ once. not how or should use k-d tree: should construct once. if measure querying, time taken reduces mere tens of milliseconds (vs seconds using brute-force approach).

q2.

this depend on spatial distribution of actual data being used , projection being used. there might slight differences based on how efficient implementation of k-d tree @ balancing constructed tree. if querying single point, result deterministic , unaffected distribution of points anyway.

with sample data using, has strong central symmetry, , map projection (transverese mercator), difference should negligible.

q3.

technically, answer question trivial: using haversine formula geographic distance measurement both more accurate , slower. whether tradeoff between accuracy , speed warranted depends heavily on use case , spatial distribution of data (mostly on spatial extent, obviously).

if spatial extent of points on small, regional side, using suitable projection , simple euclidean distance measure might accurate enough use case , faster using haversine formula.

Search This Blog

Ben

scipy - Python nearest neighbour - coordinates -

q1.

q2.

q3.

Comments

Post a Comment

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

post - imageshack API cURL -

dataset - MPAndroidchart returning no chart Data available -