Python & Pandas: How to query if a list-type column contains something?

Python & Pandas: How to query if a list-type column contains something?

I have a dataframe, which contains info about movies. It has a column called genre, which contains a list of genres it belongs to. For example

genre

df['genre'] ## returns 0 ['comedy', 'sci-fi'] 1 ['action', 'romance', 'comedy'] 2 ['documentary'] 3 ['crime','horror'] ...

I want to know how can I query the df, so it returns the movie belongs to a cerain genre?

For example, something may like df['genre'].contains('comedy') returns 0, 1.

df['genre'].contains('comedy')

I know for a list, I can do things like

'comedy' in ['comedy', 'sci-fi']

but in pandas, I didn't find something similar, the only thing I know is df['genre'].str.contains(), but it didn't work for the list type.

df['genre'].str.contains()

4 Answers
4

You can use apply for create mask and then boolean indexing:

apply

mask

boolean indexing

mask = df.genre.apply(lambda x: 'comedy' in x) df1 = df[mask] print (df1) genre 0 [comedy, sci-fi] 1 [action, romance, comedy]

using sets

df.genre.map(set(['comedy']).issubset) 0 True 1 True 2 False 3 False dtype: bool

df.genre[df.genre.map(set(['comedy']).issubset)] 0 [comedy, sci-fi] 1 [action, romance, comedy] dtype: object

presented in a way I like better

comedy = set(['comedy']) iscomedy = comedy.issubset df[df.genre.map(iscomedy)]

more efficient

comedy = set(['comedy']) iscomedy = comedy.issubset df[[iscomedy(l) for l in df.genre.values.tolist()]]

using str in two passes
slow! and not perfectly accurate!

str

df[df.genre.str.join(' ').str.contains('comedy')]

According to the source code, you can use .str.contains(..., regex=False).

.str.contains(..., regex=False)

That was my initial thought which unfortunately doesn't work as it returns True even for partial string matches.
– Nickil Maveli
Jan 7 '17 at 8:29

True

A complete example:

import pandas as pd data = pd.DataFrame([[['foo', 'bar']], [['bar', 'baz']]], columns=['list_column']) print(data) list_column 0 [foo, bar] 1 [bar, baz] filtered_data = data.loc[ lambda df: df.list_column.apply( lambda l: 'foo' in l ) ] print(filtered_data) list_column 0 [foo, bar]

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

pZ2E2TmwTR,QBMQx6q4Tr3OOM4ICyDaYelP0zBIF pn0Qx8htiiDOLL,6

搜尋此網誌

Ciugk