scipy - Python nearest neighbour - coordinates -
i wanted check using scipy's kd tree correctly because appears slower simple bruteforce.
i had 3 questions regarding this:
q1.
if create following test data:
nplen = 1000000 # wgs84 lat/long point = [51.349,-0.19] # contains wgs84 lat/long points = np.ndarray.tolist(np.column_stack( [np.round(np.random.randn(nplen)+51,5), np.round(np.random.randn(nplen),5)]))
and create 3 functions:
def kd_test(points,point): """ kd tree""" return points[spatial.kdtree(points).query(point)[1]] def ckd_test(points,point): """ c implementation of kd tree""" return points[spatial.ckdtree(points).query(point)[1]] def closest_math(points,point): """ simple angle""" return (min((hypot(x2-point[1],y2-point[0]),y2,x2) y2,x2 in points))[1:3]
i expect ckd tree fastest, - running this:
print("co-ordinate: ", f(points,point)) print("index: ", points.index(list(f(points,point)))) %timeit f(points,point)
result times - simple bruteforce method faster:
closest_math: 1 loops, best of 3: 3.59 s per loop ckd_test: 1 loops, best of 3: 13.5 s per loop kd_test: 1 loops, best of 3: 30.9 s per loop
is because using wrong - somehow?
q2.
i assume ranking (rather distance) of closest points 1 still needs project data. however, seems projected , un-projected points give me same nearest neighbour:
def proj_list(points, inproj = proj(init='epsg:4326'), outproj = proj(init='epsg:27700')): """ projected geo coordinates""" return [list(transform(inproj,outproj,x,y)) y,x in points] proj_points = proj_list(points) proj_point = proj_list([point])[0]
is because spread of points not big enough introduce distortion? re-ran few times , still got same index out of projected , un-projected lists being returned.
q3.
is faster project points (like above) , calculate hypotenuse distance compared calculating haversine or vincenty distance on (un-projected) latitude/longitudes? option more accurate? ran small test:
from math import * def haversine(origin, destination): """ find distance between pair of lat/lng coordinates """ lat1, lon1, lat2, lon2 = map(radians, [origin[0],origin[1],destination[0],destination[1]]) dlon = lon2 - lon1 dlat = lat2 - lat1 = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2 c = 2 * asin(sqrt(a)) r = 6371000 # metres return (c * r) def closest_math_unproj(points,point): """ haversine on unprojected """ return (min((haversine(point,pt),pt[0],pt[1]) pt in points)) def closest_math_proj(points,point): """ simple angle since projected""" return (min((hypot(x2-point[1],y2-point[0]),y2,x2) y2,x2 in points))
results:
so seems projecting , doing distance faster not - however, not sure method bring more accurate results.
testing on online vincenty calculation seems projected co-ordinates way go:
q1.
the reason apparent inefficiency of k-d tree quite simple: measuring both construction , querying of k-d tree @ once. not how or should use k-d tree: should construct once. if measure querying, time taken reduces mere tens of milliseconds (vs seconds using brute-force approach).
q2.
this depend on spatial distribution of actual data being used , projection being used. there might slight differences based on how efficient implementation of k-d tree @ balancing constructed tree. if querying single point, result deterministic , unaffected distribution of points anyway.
with sample data using, has strong central symmetry, , map projection (transverese mercator), difference should negligible.
q3.
technically, answer question trivial: using haversine formula geographic distance measurement both more accurate , slower. whether tradeoff between accuracy , speed warranted depends heavily on use case , spatial distribution of data (mostly on spatial extent, obviously).
if spatial extent of points on small, regional side, using suitable projection , simple euclidean distance measure might accurate enough use case , faster using haversine formula.
Comments
Post a Comment