Merge dataframes based on substrings

Merge dataframes based on substrings

Problem Description:

I want to merge/join two large dataframes while the ‘id’ column the dataframe on the right is assumed to be substrings of the left ‘id’ column.

For illustration purposes:

import pandas as pd
import numpy as np


This is df1:

          id  numbers  add_info
0        abc      1.0    3123.0
1     adcfek      2.0       NaN
2  acefeasdq      NaN  312441.0

And this is df2:

    matching  needed_info  other_info
0        adc            1          22
1        fek            2          33
2  acefeasdq            3          11
3      abcef            4          44
4       acce            5          55
5        dcf            6          66

And this is the desired output:

          id  numbers  add_info  needed_info  other_info
0        abc      1.0    3123.0          NaN         NaN
1     adcfek      2.0       NaN          2.0        33.0
2     adcfek      2.0       NaN          6.0        66.0
3  acefeasdq      NaN  312441.0          3.0        11.0

So as described, I only want to merge the additional columns only when the ‘matching’ column is a substring of the ‘id’ column. If it is the other way around, e.g. ‘abc’ is a substring of ‘adcef’, nothing should happen.

In my data, a lot of the matches between df1 and df2 are actually exact, like the ‘acefeasdq’ row. But there are cases where ‘id’s contain multiple ‘matching’s. For the moment, it is okish to ignore these cases but I’d like to learn how I can tackle this issue. And additionally, is it possible to mark out the rows that are merged based on substrings and the rows that are merged exactly?

Solution – 1

If your processing can handle CROSS JOIN (problematic with large datasets), then you could cross join and then delete/filter only those you want.

map= cross.apply(lambda x: str(x['matching']) in str(x['id']), axis=1) #create map of booleans
final = cross[map] #get only those where condition was met

Solution – 2

You can use pd.merge(how='cross') to create a dataframe containing all combinations of the rows. And then filter the dataframe using a boolean series:

df = pd.merge(df1, df2, how="cross")
include_row = df.apply(lambda row: row.matching in, axis=1)
filtered = df.loc[include_row]


Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.