Is using “not in” faster than using “in” in Python3?

68
January 01, 2020, at 12:30 PM

Let's say we're solving a simple word count problem. There's a list and we're trying to find the word count of each word occurring in the list. Which pattern is faster here?

book_title =  ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin']
word_count_dict = {}
pattern 1
for word in book_title:
    if word in word_count_dict:
        word_count_dict[word] += 1
    else:
        word_count_dict[word] = 1
pattern 2
for word in book_title:
    if word not in word_counter:
        word_counter[word] = 1
    else:
        word_counter[word] += 1
Answer 1

Six of one, half a dozen of the other. They should be roughly equivalent to each other - in computational terms, a not operation is nearly negligible (literally the cheapest possible operation), and the in operation in a hashtable like a dictionary runs in constant time (either the hash is there, or it's not). If we were dealing with a list, it would run in linear time, but still the between in and not in. See also computational complexity of python data structures.

So basically, use whichever makes your code more understandable.

That said, have you considered using a collections.Counter, a data structure specifically designed for this purpose?

import collections
book_title = ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin']
word_counts = collections.Counter(book_title)
print(word_counts)
# Counter({'great': 2, 'the': 2, 'adventures': 2, 'of': 2, 'expectations': 1, 'sherlock': 1, 'holmes': 1, 'gasby': 1, 'hamlet': 1, 'huckleberry': 1, 'fin': 1})

You can typecast a collections.Counter to a dict if you need to, and in fact collections.Counter is a subclass of dict. It even has an .update() method specifically designed to work with other counters - if you add another book title, just feed it into a Counter and then .update() the original with that.

Answer 2

Using dis, you can look at the bytecode generated for each of your methods.

The only difference I could see running your code through the disassembler was right at in and not in, where the difference in bytecode was:

COMPARE_OP 7 (not in) or COMPARE_OP 6 (in)

And afterwards POP_JUMP_IF_FALSE (i.e. continue onto the next instruction for this condition)

All in all, both of the methods seems to have the same amount of instructions, regardless of the comparison returning true or false, and therefore most probably executes equally fast.

There might be, however, some underlying optimizations closer to CPU instructions which would cause one or the other methods to be faster, but I would consider that territory out of scope for this question. If this would be the case, then I believe a simple measure of execution time over a larger list would prove which one is faster.

Execution speed of both instructions, beneath Python bytecode, might differ between Python versions, build, OS or architecture. You'd probably be able to make a small change in the Python source code to make one or the other instruction execute faster.

Answer 3

They have approximately the same cost. Think of not in operator as being in operator applied first and then a logical not applied to that result(which is almost negligible).

For confirmation, here's a little experiment you can perform to measure execution time

from time import time
book_title =  ['hi']*100000+['there']*10000
word_count_dict1 = {}
word_count_dict2 = {}
start = time()
for word in book_title:
    if word in word_count_dict1:
        word_count_dict1[word] += 1
    else:
        word_count_dict1[word] = 1
print(time()-start)
start = time()
for word in book_title:
    if word not in word_count_dict2:
        word_count_dict2[word] = 1
    else:
        word_count_dict2[word] += 1
print(time()-start)

Output (may vary for you)

0.021044015884399414
0.02713179588317871
Answer 4

Basically, they have the same cost. From Python 3 reference

The operator not in is defined to have the inverse truth value of in.

Answer 5

You can check the timing of operation for both pattern and compare yourself.

import timeit
def pattern1():
  title = ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin']
  counts = {}
  for word in title:
    if word in counts:
      counts[word] += 1
    else:
      counts[word] = 1
def pattern2():
  title = ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin']
  counts = {}
  for word in title:
    if word not in counts:
      counts[word] = 1
    else:
      counts[word] += 1
sample1 = [timeit.timeit(pattern1, number=10000) for _ in range(10)]
sample2 = [timeit.timeit(pattern2, number=10000) for _ in range(10)]
print(sum(sample1) / len(sample1))
# 0.01713230140048836
print(sum(sample2) / len(sample2))
# 0.017954919600015273

As we can see, the difference is negligible.

READ ALSO
How do I downgrade my version of python from 3.7.5 to 3.6.5 on ubuntu

How do I downgrade my version of python from 3.7.5 to 3.6.5 on ubuntu

So currently, I have ubuntu 19And it comes by default with python 3

2149
Failing when inserting generated column of date or time type from a datetime/timstamp source

Failing when inserting generated column of date or time type from a datetime/timstamp source

A generated column is a column containing the result of an expression or function based on another fieldI want a simple date and time field derived from a datetime field, but to insert records I must adopt a workaround

101
What is the best way to load the bootstrap card with contents dynamically?

What is the best way to load the bootstrap card with contents dynamically?

I am a beginner in web developmentI need to create a page with bootstrap card dynamically by fetching values from database

97