Why assign method in pandas method chaining behave differently if it is applied in chain after group by?

Why assign method in pandas method chaining behave differently if it is applied in chain after group by?

Problem Description:

I am trying to chain some methods in pandas but seems like order of methods is restrictive in Pandas.Let me explain this with mpg data.

In two of the below options, I have changed the order of the assign method. In option 1, it is before group by and it works as expected. While in option 2, it is after group by and it produces garbage output. In R/tidyverse I could simply do ungroup() and use mutate() either before or after group by and it would still produce the same output.

import pandas as pd
import seaborn as sns
df = sns.load_dataset("mpg")

Option 1

(
 df
 .assign(origin=df.origin.map({'europe':'Europe'}).fillna(df.origin))
 .query(("origin=='Europe' & model_year==80"))
 .groupby(['origin','cylinders'],dropna=False)
 .mpg
 .sum()
 .reset_index()
)

Option 2

(
 df
 .query(("origin=='europe' & model_year==80"))
 .groupby(['origin','cylinders'],dropna=False)
 .mpg
 .sum()
 .reset_index()
 .assign(origin=df.origin.map({'europe':'Europe'}).fillna(df.origin))
)

The whole thing can also be done quite neatly without method chaining in Pandas but I am trying to see if I can make method chaining work for myself.

How can I ensure assign method in above two options produce same output regardless of where it is in the chain of methods?

Solution – 1

The key thing here is actually the .reset_index(). In the original data, the first two rows have "usa" as their origin, so those get applied to the transformed data.

To illustrate, we can join (on the index):

tra = (
    df
    .query("origin=='europe' & model_year==80")
    .groupby(['origin', 'cylinders'], dropna=False)
    ['mpg'].sum()
    .reset_index()
)
tra.join(df['origin'], rsuffix='_2')
   origin  cylinders    mpg origin_2
0  europe          4  299.2      usa
1  europe          5   36.4      usa

To fix it, you could use a lambda to make use of the transformed data (as sammywemmy wrote in a comment):

tra.assign(origin=lambda df_:
    df_['origin'].map({'europe':'Europe'}).fillna(df_['origin'])
)
   origin  cylinders    mpg
0  Europe          4  299.2
1  Europe          5   36.4
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject