Transform Pandas DataFrame of categorical variables to MultiIndex with count and proportion

December 29, 2017, at 03:35 AM

I have a Pandas DataFrame containing several categorical variables. For example:

import pandas as pd
d = {'grade':['A','B','C','A','B'], 
df = pd.DataFrame(d)

I would like to transform this to a MultiIndex DataFrame with the following properties:

  • First level index is the variable name (e.g. 'grade')
  • Second level index is the levels within the variable (e.g. 'A', 'B', 'C')
  • One column contains 'n', a count of the number of times the level appears
  • A second column contains 'proportion', the proportion represented by this level.

For example:

Could anyone suggest a method for creating this MultiIndex DataFrame?

Answer 1

You can try this ..

              n  pct
year  2012  2.0  0.4
      2013  3.0  0.6
grade A     2.0  0.4
      B     2.0  0.4
      C     1.0  0.2
Answer 2

Another way you can do this to use melt and groupby:

df_out = df.melt().groupby(['variable','value']).size().to_frame(name='n')
df_out['proportion'] = df_out['n'].div(df_out.n.sum(level=0),level=0)


                n  proportion
variable value               
grade    A      2         0.4
         B      2         0.4
         C      1         0.2
year     2012   2         0.4
         2013   3         0.6

And, if you really want to get crazy and do it in a one-liner:

  .pipe(lambda x: x.assign(proportion = x[['n']]/x.groupby(level=0).transform('sum'))))

Upgraded solution using @Wen pct calculation:

  .pipe(lambda x: x.assign(proportion = x['n'].div(x.n.sum(level=0),level=0))))
Answer 3

Stey by step method:

df1 = df.groupby("grade").count()
df2 = df.groupby("year").count() 
df1.columns = ['n']
df2.columns = ['n']
df1['proportion'] = df1.divide(df1.sum())
df2['proportion'] = df2.divide(df2.sum())
df_new = pd.concat([df1, df2], keys=['grade', 'year'], names=['variable'])
  • while using concat, one can assign keys that would be the outermost-layer index. Also assign name to this new index with names=.

Answer 4

The DataFrame can be created by stacking each variable in a loop, but this seems inefficient. e.g.:

d_end = []
for c in df.columns:
    temp_df = pd.DataFrame(df[c].value_counts().rename('n'))
    temp_df['proportion'] = temp_df['n'] / temp_df['n'].sum()
    temp_df['variable'] = c
df_end = pd.concat(d_end,axis=0)

I'm hoping someone can suggest a better way, avoiding the loop.

