filter a DataFrame using complete word only

filter a DataFrame using complete word only

Problem Description:

I have a large DataFrame (named: complete) of data (only two columns). I want to filter it using complete words only, instead of substrings. Ex:

complete dataframe:

commentsentiment
fast running0.9
heavily raining0.5
in the house0.1
coming in0.0
rubbing it-0.5

if I set a substring to filter my table:

substring = 'in'
comp = complete[complete.apply(lambda row: row.astype(str).str.contains(substring, case=False).any(), axis=1)]

output comp:

commentsentiment
fast running0.9
heavily raining0.5
in the house0.1
coming in0.0
rubbing it-0.5

It returns the same DF since all words do have "in" as a substring.

My desired output:

commentsentiment
in the house0.1
coming in0.0

Filter it only if the substring is found as a word, not as a substring.

¿How can this be done?

Solution – 1

Use word boundaries (b) in your pattern:

substring = 'in'
comp = complete[complete['comment'].str.contains(fr'b{substring}b', case=False)]

If you have many columns to check, aggregate after (this is much more efficient than apply with axis=1):

comp = complete[complete[list_of_cols].apply(lambda s: s.str.contains(fr'b{substring}b', case=False)).any(axis=1)]

Solution – 2

Another possible solution, which uses lookaround and w means word character and !w means no word character. See demonstration to understand the logic.

sub = 'in'
df.loc[df.comment.str.contains(fr'(?<!w){sub}(?!w)'), :]

Output:

        comment  sentiment
2  in the house        0.1
3     coming in        0.0
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject