python: Create a multiindex pandas DF based on condition of column names

python: Create a multiindex pandas DF based on condition of column names

Problem Description:

I have a dataframe as follows:

arrays = [np.array(["berlin", "berlin", "paris", "paris", "rome", "rome", "seville", "seville"]),
          np.array(["one", "two", "one", "two", "one", "two", "one", "two"])]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays, columns = ['mike','ana','manu','analia'])

it has a multiindex in the rows.
I would like to transform that DF into another one that would also have a multindex in the columns.

the function could be generalized as this:

def sortit(colname):
    if colname.startswith('m'):
        return 'm'
    elif colname.startswith('m'): 
        return 'a'

The expected output is the following:

arrays = [np.array(["berlin", "berlin", "paris", "paris", "rome", "rome", "seville", "seville"]),
          np.array(["one", "two", "one", "two", "one", "two", "one", "two"])]
tuples_i = list(zip(*arrays))
index_rows = pd.MultiIndex.from_tuples(tuples_i, names=["city", "number"])
arrays2 = [np.array(["m","m", "a","a"]),
          np.array(['mike','manu','ana','analia'])]
tuples_c = list(zip(*arrays2))
print(tuples)
index_columns = pd.MultiIndex.from_tuples(tuples_c, names=["department", "name"])
df = pd.DataFrame(np.random.randn(8, 4), index=index_rows, columns = index_columns)

df

two important notes.

My starting point is the dataframe with multindex in the rows and no multiindex in the cols. I can not change that.

The method that illustrates here to which multiindex belong every name (column name) is a straight forward one, as example, the reality is that that function is way more complicated and time consuming, that’s why I would like to create once the multilevel col index to make queries later much faster.

Solution – 1

You can craft a new MultiIndex with MultiIndex.from_arrays:

idx = pd.MultiIndex.from_arrays([df.columns.str.extract('(.)', expand=False),
                                 df.columns],
                                names=['department', 'name'])

df.columns = idx

print(df.sort_index(level=0, axis=1))

Output:

department          a                   m          
name              ana    analia      manu      mike
berlin  one  0.465270 -0.549246  0.931020  0.027496
        two -2.156006 -2.053703  0.162281  0.741966
paris   one  0.084072  1.729949  1.366554  0.402933
        two  1.157244  1.762093 -1.808943 -1.737110
rome    one -0.009257 -0.457297 -0.479836 -2.483149
        two -0.593379 -0.012763 -1.491018 -0.439712
seville one -1.118433  0.029189 -0.805858 -0.342481
        two -0.389120 -0.390189 -1.260496 -0.010572

Solution – 2

Code

you can make multi index easily from making tuple

(df.set_axis(df.columns.map(lambda x: (x[0], x)), axis=1)
 .rename_axis(['department', 'name'], axis=1))

output:

department  m       a       m       a
name        mike    ana     manu    analia
berlin  one 0.6     -0.0    2.9     1.3
        two 1.3     0.4     0.0     -3.0
paris   one -0.5    -0.8    0.4     0.0
        two -0.6    -1.0    0.5     0.3
rome    one -1.5    0.2     -0.0    1.4
        two -1.5    -1.9    0.0     -0.0
seville one -1.3    1.3     0.7     0.5
        two -0.2    -0.2    -0.7    0.4
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject