How to compress pandas dataframe

128
April 12, 2022, at 1:50 PM

Below I am showing few entries of my dataframe. My (each) dataframe has millions row.

import pandas as pd
data = [{'stamp':'12/31/2020 9:35:42 AM', 'value': 21.99, 'trigger': True}, 
        {'stamp':'12/31/2020 10:35:42 AM', 'value': 22.443, 'trigger': False}, 
        {'stamp':'12/31/2020 11:35:42 AM', 'value': 19.00, 'trigger': False}, 
        {'stamp':'12/31/2020 9:45:42 AM', 'value': 45.02, 'trigger': False}, 
        {'stamp':'12/31/2020 9:55:42 AM', 'value': 48, 'trigger': False}, 
        {'stamp':'12/31/2020 11:35:42 AM', 'value': 48.99, 'trigger': False}]
df = pd.DataFrame(data)

Below is how few ways I can save:

df.to_parquet('df.parquet', compression = 'gzip')
df.to_csv('df.csv')

I don't see much improvement in to_parquet as compared to to_csv. I wish to minimize the file size on hard drive. is there any way out?

Answer 1

parquet gives you compression over a column when that columns (e.g.) has many continuous sequences of the same value. (See wiki for more) From your example data only trigger shows a sign of that, but the improvement may not be large because it was not the one taking up most space in the first place.

Saving integer is cheaper than saving a long string, so you may consider to change your stamp from str into timestamp value which is int, by doing this

df['stamp'] = pd.to_datetime(df['stamp']).values.astype(np.int64) // 10**9

We divide it by 10**9 because your stamp appears to be precise to the level of second only, instead of nanosecond which is the default.

but you will need to convert it back to the readable str form the next time you read the saved data, by

df['stamp'] = pd.to_datetime(df['stamp'] * 10**9)
Rent Charter Buses Company
READ ALSO
How to make a custom python class to support memoryview conversion

How to make a custom python class to support memoryview conversion

Is it possible for a custom class to implement memoryview(obj)?

89
Using pd.read_html to return a specific table from a webpage of multiple tables

Using pd.read_html to return a specific table from a webpage of multiple tables

I am trying to return a specific table from this webpage

69
Django - get data from form and return to user so they can edit/delete

Django - get data from form and return to user so they can edit/delete

New to Django and making a restuarant booking systemI have everything going to my database but now i'm trying to return the info back to the user, ideally in a similar form view to what i have, so they can edit or delete the booking

80
For python's built-in sort()'s key parameter, when to use base class vs object as identifier?

For python's built-in sort()'s key parameter, when to use base class vs object as identifier?

Let me take an example to clarify what I am talking aboutSay I'm trying to sort() a string s with a key of count (frequency of letters) and in a different case, with a key of lower

87