Finding optimal nearest neighbour pairs

112
April 04, 2022, at 5:40 PM

Goal

I am writing a "colocalization" script to identify unique co-localized pairs of coordinates between two sets of data. My data is quite large with <100k points in each set so performance is pretty important.

For example, I have two sets of points:

import numpy as np
points_a = np.array([[1, 1],[2, 2],[3, 3],[6, 6]])
points_b = np.array([[1, 1],[2, 3],[3, 5],[6, 6], [7,6]]) # may be longer than points_a

For each point in points_a I want to find the nearest point in points_b. However, I don't want any point in points_b used in more than one pair. I can easily find the nearest neighbors using NearestNeighbors or one of the similar routines:

from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, indices = neigh.kneighbors(points_a)
print(indices)
>>> [0, 1, 1, 3]

As above, this can give me a solution where a point in point_b is used twice. I would like to instead find the solution where each point is used once while minimizing the total distance across all pairs. In the above case:

[0, 1, 2, 3]

I figure a start would be to use NearestNeighbors or similar to find nearest neighbor candidates:

from scipy.spatial import KDTree
max_search_r = 3
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, indices = neigh.radius_neighbors(points_a, max_search_radius)
print(distances)
print(indices)
>>>[[0, 2,23], [1.41, 1], [2.82, 1, 2], [0, 1]]
>>>[[0, 1], [0, 1], [0, 1, 2], [0, 1]]

This shrinks down the overall search parameters but I am unclear how I can then compute the global optimum. I stumbled across this post: Find optimal unique neighbour pairs based on closest distance but the solution is for only a single set of points and I am unclear how I could translate the method to my case.

Any advice would be greatly appreciated!

Update

Hey all. With everyone's advice I found a somewhat working solution:

import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix, csgraph

def colocalize_points(points_a: np.ndarray, points_b: np.ndarray, r: int):
    """ Find pairs that minimize global distance. Filters out anything outside radius `r` """
    neigh = NearestNeighbors(n_neighbors=1)
    neigh.fit(points_b)
    distances, b_indices = neigh.radius_neighbors(points_a, radius=r)
    # flatten and get indices for A. This will also drop points in A with no matches in range
    d_flat = np.hstack(distances) + 1
    b_flat = np.hstack(b_indices)
    a_flat = np.array([i for i, neighbors in enumerate(distances) for n in neighbors])
    # filter out A points that cannot be matched
    sm = csr_matrix((d_flat, (a_flat, b_flat)))
    a_matchable = csgraph.maximum_bipartite_matching(sm, perm_type='column')
    sm_filtered = sm[a_matchable != -1]
    # now run the distance minimizing matching
    row_match, col_match = csgraph.min_weight_full_bipartite_matching(sm_filtered)
    return row_match, col_match

Only issue I have is that by filtering the matrix with maximum_bipartite_matching I cannot be sure I truly have the best result since it just returns the first match. For example, if I have 2 points in A [[2,2][3,3]] whose only candidate match is [3,3], maximum_bipartite_matching will keep whichever appears first. So if [2,2] appears first in the matrix, [3,3] will be dropped despite being a better match.

Update 1

To address comments below, here is my reasoning why maximum_bipartite_matching does not give me the desired solution. Consider points:

points_a = np.array([(1, 1), (2, 2), (3, 3)])
points_b = np.array([(1, 1), (2, 2), (3, 5), (2, 3)])

The optimal a,b point pairing that minimizes distance will be:

[(1, 1): (1, 1),
(2, 2): (2, 2),
(3, 3): (2, 3)]

However if I run the following:

neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(points_b)
distances, b_indices = neigh.radius_neighbors(points_a, radius=3)
# flatten and get indices for A. This will also drop points in A with no matches in range
d_flat = np.hstack(distances) + 1
b_flat = np.hstack(b_indices)
a_flat = np.array([i for i, neighbors in enumerate(distances) for n in neighbors])
# filter out A points that cannot be matched
sm = csr_matrix((d_flat, (a_flat, b_flat)))
a_matchable = csgraph.maximum_bipartite_matching(sm, perm_type='column')
print([(points_a[a], points_b[i]) for i, a in enumerate(a_matchable)])

I get solution:

[(1, 1): (1, 1),
(2, 2): (2, 2),
(3, 3): (3, 5)]

Swapping the last two points in points_b will give me the expected solution. This indicates to me that the algorithm is not taking the distance (weight) into account and instead just tries to maximize the number of connections. I could very well have made a mistake though so please let me know.

Rent Charter Buses Company
READ ALSO
Send raw hex and strings over sockets

Send raw hex and strings over sockets

I'm trying to send data over sockets with Python to trigger IDS rules

126
Why is my csv file not printing after I import it?

Why is my csv file not printing after I import it?

and still not receiving anything? It does not even come up with an error

77
How to determine if a pandas column type can be reduced from int64 to int32 or from float64 to float32? [duplicate]

How to determine if a pandas column type can be reduced from int64 to int32 or from float64 to float32? [duplicate]

I have a dataframe which is huge(8 gb)I am trying to find if i will loose any information if i downsize the columns from int64 to int32 or from float64 to float32

130
Folium popup does not appear python

Folium popup does not appear python

I have a geodataframe with Multistring GeometryI wanted to plot an interactive map, on which when I click on each line, the name and id would appear! I can see that mouse form would change but nothing will appear

84