Fast Python outer difference of list

112
April 01, 2022, at 11:30 AM

I want to compute the difference between every element in a Python list of equally long lists and put it into a Numpy array.

When I say difference between two lists I mean the number of differences between corresponding elements between the two lists. Here's an example difference function using a list comprehension:

def list_difference(list_a, list_b):
    len_lists = len(list_a)
    assert len_lists == len(list_b), "Lists must be the same length."
    return sum([list_a[i] != list_b[i] for i in range(len_lists)])

Then I call the difference function on every pair in my list of lists and put it into a numpy array. You could call it the outer difference of the list of lists, just like the outer product. I do it in naive for loops:

import numpy as np
import time
sequences = [
    ["A", "A", "A", "B", "C"],
    ["B", "A", "B", "A", "B"],
    ["B", "A", "C", "C", "B"],
    ["B", "A", "C", "C", "C"],
]
start = time.time()
n_seq = len(sequences)
dists = np.zeros((n_seq, n_seq))
for row in range(n_seq):
    for col in range(n_seq):
        if row >= col:
            continue
        dists[row, col] = list_difference(sequences[row], sequences[col])
dists += dists.T
print(dists)
print(f"Time: {time.time() - start} seconds")

The result is

[[0. 4. 4. 3.]
 [4. 0. 2. 3.]
 [4. 2. 0. 1.]
 [3. 3. 1. 0.]]
Time: 0.0003669261932373047 seconds

This example is quick enough on my computer (best time of three above). However, running this on 64 lists (sequences) each with length 1008 it takes 293.572190 seconds, which is a while. Is there a faster way to do this?

Attempts:

1 I tried putting the inner for loop in a list comprehension:

dists = np.zeros((n_seq, n_seq))
for row in range(n_seq):
    dist_row = [list_difference(sequences[row], sequences[col]) for col in range(n_seq) if row >= col]
    dists[row, n_seq-row-1:] = dist_row
dists += dists.T

But it actually made it slower, taking 0.000523 seconds (0.70 times faster). On my larger dataset it takes 319.2168769 seconds (0.92 times faster).

2 I wondered if doing the for loops in native Python and then copying to Numpy at the end would help.

_dists = [ [0]*n_seq for i in range(n_seq)]
for row in range(n_seq):
    for col in range(n_seq):
        if col <= row:
            continue
        _dists[row][col] = list_difference(sequences[row], sequences[col])
dists = np.array(_dists)
dists += dists.T

This takes 0.0001862 seconds, which is about twice as fast as the original code. On my larger dataset the speedup isn't as significant at 229.945951 seconds (1.28 times faster), but still something.

3 Just a thought that there's probably a way to do the two outer for loops directly in Numpy.

Answer 1

A much cleaner, simpler, and faster way to do it would be to use numpy broadcasting:

sequences = np.array(sequences)
dists = (sequences[:, None] != sequences).sum(axis=2)

Output:

>>> dists
array([[0, 4, 4, 3],
       [4, 0, 2, 3],
       [4, 2, 0, 1],
       [3, 3, 1, 0]])
Rent Charter Buses Company
READ ALSO
OpenCV VideoWriter Error &quot;dimensions too large for MPEG-4&quot;

OpenCV VideoWriter Error "dimensions too large for MPEG-4"

I have some frames (dimensions: 8192x2160) that are generated from concatenating two 4096x2160 frames side by sideWhen writing these frames to a video file using OpenCV VideoWriter, I get this error: dimensions too large for MPEG-4

169
What happens to selenium code simulating browser actiions when run in non GUI unix box? Will chrome windows open?

What happens to selenium code simulating browser actiions when run in non GUI unix box? Will chrome windows open?

I built my first selenium on a mac machine, so with help of chromedriver, I see a new tab opened by my script

131
Echarts line chart symbol formatter

Echarts line chart symbol formatter

above code print like this: https://istack

117
OSError: [WinError 740] The requested operation requires elevation (pytesseract.image_to_string)

OSError: [WinError 740] The requested operation requires elevation (pytesseract.image_to_string)

I am having a simple code that has an image called "1png" and I want to convert it from Image to Text using pytesseract

137