Grouping by date and number of unique users for multiple variables

98
November 28, 2019, at 5:00 PM

I have a dataframe containing tweets. I've got columns with information about the datetime, about a unique user_id and then columns indicating if the tweet belongs to a thematic category. In the end I'd like to visualize it with a line graph.

The data looks as follows:

                 datetime             user_id  Meta  News & Media  Environment  ...
0     2019-05-08 07:16:02            21741359   NaN           NaN          1.0
1     2019-05-08 07:15:23          2785265103   NaN           NaN          1.0
2     2019-05-08 07:14:11           606785697   NaN           1.0          NaN
3     2019-05-08 07:13:42  718989200616529921   1.0           NaN          NaN
4     2019-05-08 07:13:27  939207240728350720   1.0           NaN          1.0
...                   ...                 ...   ...           ...          ...

So far I've managed to produce one just summing each theme per day with the following code:

monthly_trends = tweets_df.groupby(pd.Grouper(key='datetime', freq='D'))[list(issues.keys())].sum().fillna(0)

which gives me:

             Meta  News & Media   Environment  ...
datetime                                                                
2019-05-07  586.0          25.0          30.0      
2019-05-08  505.0          16.0          70.0      
2019-05-09  450.0          12.0          50.0     
2019-05-10  339.0           8.0          90.0               
2019-05-11  254.0           5.0          10.0    

I plot this with:

monthly_trends.plot(kind='line', figsize=(20,10), linewidth=5, fontsize=20)
plt.xlabel('Date', fontsize=20)
plt.title('Issue activity during the election period', size = 30)
plt.show()

Which gives me a nice graph. But since one user may just be spamming one theme, I'd like to get a count of the frequency of unique users per theme per day. I've tried using additional groupby's but only got errors.

Answer 1

Stack all issues, group by issue and day, and count the unique user ids:

df.columns.names = ['issue']
df_users = (df.set_index(['datetime', 'user_id'])[issues]
              .stack()
              .reset_index().groupby([pd.Grouper(key='datetime', freq='D'), 'issue'])
              .apply(lambda x: len(x.user_id.unique()))
              .rename('n_unique_users').reset_index())
print(df_users)
    datetime         issue  n_unique_users
0 2019-05-08   Environment               3
1 2019-05-08          Meta               2
2 2019-05-08  News & Media               1

Then you can reshape as required for plotting:

df_users.pivot_table(index='datetime', columns='issue', values='n_unique_users', aggfunc=sum)
issue       Environment  Meta  News & Media
datetime                                   
2019-05-08            3     2             1
Answer 2

For pandas' DataFrame.plot across multiple series you need data in wide format with separate columns. However, for unique user_id calculation you need data in long format for the aggregation. Therefore, consider melt, groupby, then pivot back for plotting. Had you not needed a

### RESHAPE LONG AND AGGREGATE      
long_df = (tweets_df.melt(id_vars=['datetime', 'user_id'], 
                          value_name = 'Count', var_name = 'Issue')
                    .query("Count >= 1")
                    .groupby([pd.Grouper(key='datetime', freq='D'), 'Issue'])['user_id'].nunique()
                    .reset_index()
          )
### RESHAPE WIDE AND PLOT
(long_df.pivot(index='datetime', columns='Issue', values='user_id')
        .plot(kind='line', title='Unique Users by Day and Tweet Issue')
)
plt.show()
plt.clf()
plt.close()
Rent Charter Buses Company
READ ALSO
Why does scipy.griddata return nans with 'cubic' interpolation if input 'values' contains nan?

Why does scipy.griddata return nans with 'cubic' interpolation if input 'values' contains nan?

I want to perform cubic interpolation of an array that contains some nan values using scipygriddata

123
Getting wrong number of rows after dropping row in Pandas Dataframe

Getting wrong number of rows after dropping row in Pandas Dataframe

After I drop a specific row in a Pandas dataframe with: df = dfdrop([rowNumber]) I no longer can get the correct number of rows with len(df

99
Signalling boolean over datetime interval in pandas dataframe

Signalling boolean over datetime interval in pandas dataframe

I got a pretty straightforward problem and there must be a simple way to solve such problemConsider the following dataframe:

118