How to efficiently find a string that is contained in another string from a different group in pandas

How to efficiently find a string that is contained in another string from a different group in pandas

Problem Description:

given the following table sample:

identifiermatched_string
occupationmanager
occupationmanager
skillmanager
departmentmarketing manager
skillmarketing

I would like to find the cases where a matched_string is contained in another ‘matched_string’ and have a different identifier.
In the end I would like to get a dataframe that lists these matches.

I have this code so far but I feel it is super inefficient:

#df5 is the dataframe I want to search through

df4 = pd.DataFrame()

#find overlaps from matched_strings -> where matched string is contained in another matched string and has a different identifier
for index, row in df5.iterrows():
    for index2, row2 in df5.iterrows():
        if row["matched_string"] in row2["matched_string"] and row["identifier"] != row2["identifier"]:
            df4 = df4.append(row1)
            df4 = df4.append(row2)      

Would someone have an idea on how to make this more efficient?

Solution – 1

Wouldn’t an aggregation as set per matched_string work for you?

df.groupby('matched_string')['identifier'].agg(set)

Output:

matched_string
manager              {occupation, skill}
marketing                        {skill}
marketing manager           {department}
Name: identifier, dtype: object

If you want to use it to filter your DataFrame and keep only the rows with multiple identifiers per matched_string:

s = df.groupby('matched_string')['identifier'].agg(set)

idx = s[s.str.len().gt(1)].index

out = df.loc[df['matched_string'].isin(idx)]

Output:

   identifier matched_string
0  occupation        manager
1  occupation        manager
2       skill        manager
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject