Select rows from a DataFrame based on values in a column in pandas

4964
March 27, 2017, at 4:48 PM

How to select rows from a DataFrame based on values in some column in pandas?
In SQL I would use:

select * from table where colume_name = some_value. 

I tried to look at pandas documentation but did not immediately find the answer.

Answer 1

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

df.loc[(df['column_name'] == some_value) & df['other_column'].isin(some_values)]

To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df.loc[~df['column_name'].isin(some_values)]

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14
print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

If you have multiple values you want to include, put them in a list (or more generally, any iterable) and use isin:

print(df.loc[df['B'].isin(['one','three'])])

yields

     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

Note, however, that if you wish to do this many times, it is more efficient to make an index first, and then use df.loc:

df = df.set_index(['B'])
print(df.loc['one'])

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
one  foo  6  12

or, to include multiple values from the index use df.index.isin:

df.loc[df.index.isin(['one','two'])]

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
two  foo  2   4
two  foo  4   8
two  bar  5  10
one  foo  6  12
Answer 2

tl;dr

The pandas equivalent to

select * from table where column_name = some_value

is

table[table.column_name == some_value]

Multiple conditions:

table((table.column_name == some_value) | (table.column_name2 == some_value2))

or

table.query('column_name == some_value | column_name2 == some_value2')

Code example

import pandas as pd
# Create data set
d = {'foo':[100, 111, 222], 
     'bar':[333, 444, 555]}
df = pd.DataFrame(d)
# Full dataframe:
df
# Shows:
#    bar   foo 
# 0  333   100
# 1  444   111
# 2  555   222
# Output only the row(s) in df where foo is 222:
df[df.foo == 222]
# Shows:
#    bar  foo
# 2  555  222

In the above code it is the line df[df.foo == 222] that gives the rows based on the column value, 222 in this case.

Multiple conditions are also possible:

df[(df.foo == 222) | (df.bar == 444)]
#    bar  foo
# 1  444  111
# 2  555  222

But at that point I would recommend using the query function, since it's less verbose and yields the same result:

df.query('foo == 222 | bar == 444')
Answer 3

I find the syntax of the previous answers to be redundant and difficult to remember. Pandas introduced the query() method in v0.13 and I much prefer it. For your question, you could do df.query('col == val')

Reproduced from http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query

In [167]: n = 10
In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [169]: df
Out[169]: 
          a         b         c
0  0.687704  0.582314  0.281645
1  0.250846  0.610021  0.420121
2  0.624328  0.401816  0.932146
3  0.011763  0.022921  0.244186
4  0.590198  0.325680  0.890392
5  0.598892  0.296424  0.007312
6  0.634625  0.803069  0.123872
7  0.924168  0.325076  0.303746
8  0.116822  0.364564  0.454607
9  0.986142  0.751953  0.561512
# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607
# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]: 
          a         b         c
3  0.011763  0.022921  0.244186
8  0.116822  0.364564  0.454607

You can also access variables in the environment by prepending an @.

exclude = ('red', 'orange')
df.query('color not in @exclude')
Answer 4

Here is a simple example

from pandas import DataFrame
# Create data set
d = {'Revenue':[100,111,222], 
     'Cost':[333,444,555]}
df = DataFrame(d)

# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111
print mask
# Result:
# 0    False
# 1     True
# 2    False
# Name: Revenue, dtype: bool

# Select * FROM df WHERE Revenue = 111
df[mask]
# Result:
#    Cost    Revenue
# 1  444     111
Answer 5

I just tried editing this, but I wasn't logged in, so I'm not sure where my edit went. I was trying to incorporate multiple selection. So I think a better answer is:

For a single value, the most straightforward (human readable) is probably:

df.loc[df['column_name'] == some_value]

For lists of values you can also use:

df.loc[df['column_name'].isin(some_values)]

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
               'B': 'one one two three two two one three'.split(),
               'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14
print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

If you have multiple criteria you want to select against, you can put them in a list and use 'isin':

print(df.loc[df['B'].isin(['one','three'])])

yields

      A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

Note, however, that if you wish to do this many times, it is more efficient to make A the index first, and then use df.loc:

df = df.set_index(['A'])
print(df.loc['foo'])

yields

  A      B  C   D
foo    one  0   0
foo    two  2   4
foo    two  4   8
foo    one  6  12
foo  three  7  14
Answer 6
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
df[df['A']=='foo']
OUTPUT:
   A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14
Answer 7

If you came here looking to select rows from a dataframe by including those whose column's value is NOT any of a list of values, here's how to flip around unutbu's answer for a list of values above:

df.loc[~df['column_name'].isin(some_values)]

(To not include a single value, of course, you just use the regular not equals operator, !=.)

Example:

import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split()})
print(df)

gives us

     A      B
0  foo    one
1  bar    one
2  foo    two
3  bar  three
4  foo    two
5  bar    two
6  foo    one
7  foo  three    

To subset to just those rows that AREN'T one or three in column B:

df.loc[~df['B'].isin(['one', 'three'])]

yields

     A    B
2  foo  two
4  foo  two
5  bar  two
Answer 8

To append to this famous question (though a bit too late): You can also do df.groupby('column_name').get_group('column_desired_value').reset_index() to make a new data frame with specified column having a particular value. E.g.

import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)
b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1) 
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)

Run this gives:

Original dataframe:
     A      B
0  foo    one
1  bar    one
2  foo    two
3  bar  three
4  foo    two
5  bar    two
6  foo    one
7  foo  three
Sub dataframe where B is two:
     A    B
0  foo  two
1  foo  two
2  bar  two
READ ALSO
pygame font.render() cannot update when number of digits changes

pygame font.render() cannot update when number of digits changes

I have a basic health countIt ticks down with each tick of the game, it starts at 100 but when it ticks from 100 (3 digits) it changes to 999

283
Grouping elements in a list of lists

Grouping elements in a list of lists

I have a few records which reads as follows:

353
Cython: error adding complex[double] to double

Cython: error adding complex[double] to double

According to libcpp/complexpxd adding T to complex[T] is supported:

341
How to iterate PyQt4 list with images

How to iterate PyQt4 list with images

I have created an global list named

490