Transform Pandas DataFrame of categorical variables to MultiIndex with count and proportion

December 29, 2017, at 03:35 AM

I have a Pandas DataFrame containing several categorical variables. For example:

import pandas as pd
d = {'grade':['A','B','C','A','B'], 
df = pd.DataFrame(d)

I would like to transform this to a MultiIndex DataFrame with the following properties:

  • First level index is the variable name (e.g. 'grade')
  • Second level index is the levels within the variable (e.g. 'A', 'B', 'C')
  • One column contains 'n', a count of the number of times the level appears
  • A second column contains 'proportion', the proportion represented by this level.

For example:

Could anyone suggest a method for creating this MultiIndex DataFrame?

Answer 1

You can try this ..

              n  pct
year  2012  2.0  0.4
      2013  3.0  0.6
grade A     2.0  0.4
      B     2.0  0.4
      C     1.0  0.2
Answer 2

Another way you can do this to use melt and groupby:

df_out = df.melt().groupby(['variable','value']).size().to_frame(name='n')
df_out['proportion'] = df_out['n'].div(df_out.n.sum(level=0),level=0)


                n  proportion
variable value               
grade    A      2         0.4
         B      2         0.4
         C      1         0.2
year     2012   2         0.4
         2013   3         0.6

And, if you really want to get crazy and do it in a one-liner:

  .pipe(lambda x: x.assign(proportion = x[['n']]/x.groupby(level=0).transform('sum'))))

Upgraded solution using @Wen pct calculation:

  .pipe(lambda x: x.assign(proportion = x['n'].div(x.n.sum(level=0),level=0))))
Answer 3

Stey by step method:

df1 = df.groupby("grade").count()
df2 = df.groupby("year").count() 
df1.columns = ['n']
df2.columns = ['n']
df1['proportion'] = df1.divide(df1.sum())
df2['proportion'] = df2.divide(df2.sum())
df_new = pd.concat([df1, df2], keys=['grade', 'year'], names=['variable'])
  • while using concat, one can assign keys that would be the outermost-layer index. Also assign name to this new index with names=.

Answer 4

The DataFrame can be created by stacking each variable in a loop, but this seems inefficient. e.g.:

d_end = []
for c in df.columns:
    temp_df = pd.DataFrame(df[c].value_counts().rename('n'))
    temp_df['proportion'] = temp_df['n'] / temp_df['n'].sum()
    temp_df['variable'] = c
df_end = pd.concat(d_end,axis=0)

I'm hoping someone can suggest a better way, avoiding the loop.

Rent Charter Buses Company
benefits of manually running multiple instances of a program

benefits of manually running multiple instances of a program

So i've googled multithreading for python3 and not quite found what i'm looking for

strange behavior of opencv 3.x 's function findChessBoardCorners()

strange behavior of opencv 3.x 's function findChessBoardCorners()

so I have two images that are exact the same except one has a black band on the top and the other does not, I then run the following python lines

Axes3D Plot Wireframe from DateTime

Axes3D Plot Wireframe from DateTime

I'm working with Axes3D, and I want to create a 3d plot of two floats and a timestamp from a pandas seriesThe plot function looks like this:

Make new pandas columns based on pipe-delimited column with possible repeats

Make new pandas columns based on pipe-delimited column with possible repeats

This question pertains to the fine solution to my previous question, Create Multiple New Columns Based on Pipe-Delimited Column in Pandas