Removing multiple headers in a single excel sheet

86
April 08, 2022, at 08:10 AM

Similar to Remove header row in Excel using pandas

I wish to remove headers from an excel sheet.

Using the same example but edited:

Company ABC
Account Name 1
Account No.1
Description    group     pair          amount    ...  result
value1         value1    value1        value1    ...  value1
value2         value2    value2        value2    ...  value2
totals                               sum values
Account Name 2
Account No.2
Description    group     pair          amount    ...  result
value3         value3    value3        value3    ...  value3
value4         value4    value4        value4    ...  value4
totals                               sum values
Sales
00234
Description    group     pair          amount    ...  result
value5         value5    value5        value5    ...  value5
value6         value6    value6        value6    ...  value6
totals                               sum values
Inventory
00012345
Description    group     pair          amount    ...  result
value7         value7    value7        value7    ...  value7
value8         value8    value8        value8    ...  value8
value9         value9    value9        value9    ...  value9
totals                               sum values

I wish to concatenate it to become like

cabinet_name   group     pair          amount    ...  result
value1         value1    value1        value1    ...  value1
value2         value2    value2        value2    ...  value2
value3         value3    value3        value3    ...  value3
value4         value4    value4        value4    ...  value4
value5         value5    value5        value5    ...  value5
value6         value6    value6        value6    ...  value6
value7         value7    value7        value7    ...  value7
value8         value8    value8        value8    ...  value8
value9         value9    value9        value9    ...  value9

I've managed to remove the top header by skipping rows e.g.(skiprows = 4). However, This still leaves the other headers and totals which append like so:

cabinet_name   group     pair          amount    ...  result
value1         value1    value1        value1    ...  value1
value2         value2    value2        value2    ...  value2
totals                               sum values
Account Name 2
Account No.2
cabinet_name   group     pair          amount    ...  result
value3         value3    value3        value3    ...  value3
value4         value4    value4        value4    ...  value4
totals                               sum values
Sales
00234
Description    group     pair          amount    ...  result
value5         value5    value5        value5    ...  value5
value6         value6    value6        value6    ...  value6
totals                               sum values
Inventory
00012345
Description    group     pair          amount    ...  result
value7         value7    value7        value7    ...  value7
value8         value8    value8        value8    ...  value8
value9         value9    value9        value9    ...  value9
totals                               sum values

Would be extremely grateful if someone could enlighten me on how to clean this sheet with pandas as all I've seen online are working on only 1 table in a single excel sheet.

Let me know if I've missed anything out, I would gladly edit this question.

I thought this might be useful,

My usual process for cleaning the files in excel, is to firstly delete the first 4 rows, only leaving

cabinet_name   group     pair          amount    ...  result
value1         value1    value1        value1    ...  value1
value2         value2    value2        value2    ...  value2
totals         *blank*   *blank*       sum values
Account Name 2
Account No.2
cabinet_name   group     pair          amount    ...  result
value3         value3    value3        value3    ...  value3
value4         value4    value4        value4    ...  value4
totals         *blank*   *blank*       sum values
Account Name 3
Account No.3
cabinet_name   group     pair          amount    ...  result
value5         value5    value5        value5    ...  value5
value6         value6    value6        value6    ...  value6
totals         *blank*   *blank*       sum values

Then I will filter group or pair to look for blank values in said column and deleting it.

This is the result of

print(df_total.head(8).to_dict())
import datetime
from numpy import nan
{'Date': {0: nan, 1: datetime.datetime(2021, 1, 1, 0, 0), 2: datetime.datetime(2021, 1, 1, 0, 0), 3: datetime.datetime(2021, 1, 29, 0, 0), 4: datetime.datetime(2021, 1, 31, 0, 0), 5: 
'Totals', 6: 'Net difference', 7: nan}, 
'Journal number': {0: nan, 1: 'AX009473', 2: 'AX009473', 3: 'AX003312', 4: 'AX009641', 5: nan, 6: nan, 7: nan}, 
'Voucher': {0: nan, 1: 'TSPN-2021-3', 2: 'TSPN-2021-3', 3: 'GBJ-2021-1', 4: 'VIT-2021-1', 5: nan, 6: nan, 7: nan}, 
'Posting type': {0: nan, 1: nan, 2: nan, 3: 'Ledger journal', 4: 'Ledger journal', 5: nan, 6: nan, 7: nan}, 
'Ledger account': {0: nan, 1: '00388211', 2: '00388211', 3: '00388211', 4: '00388211', 5: nan, 6: nan, 7: nan}, 
'Description': {0: nan, 1: nan, 2: nan, 3: 'DISBERSMENT FOR PETROL', 4: 'TAXI FAIR', 5: nan, 6: nan, 7: nan}, 
'Unnamed: 6': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}, 
'Unnamed: 7': {0: nan, 1: 'SGD', 2: 'USD', 3: 'SGD', 4: 'SGD', 5: nan, 6: nan, 7: nan}, 
'Amount in transaction currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: nan, 6: nan, 7: nan}, 'Unnamed: 9': {0: 'Credit', 1: 0, 2: 25, 3: 52, 4: 0, 5: nan, 6: nan, 7: nan}, 
'Amount in accounting currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: 18.55, 6: nan, 7: nan}, 'Unnamed: 11': {0: 'Credit', 1: 0, 2: 33.42, 3: 52, 4: 0, 5: 85.42, 6: 66.87, 7: nan}, 
'Amount in reporting currency': {0: 'Debit', 1: 13.55, 2: 0, 3: 0, 4: 5, 5: 18.55, 6: nan, 7: nan}, 'Unnamed: 13': {0: 'Credit', 1: 0, 2: 33.42, 3: 52, 4: 0, 5: 85.42, 6: nan, 7: nan}, 
'Unnamed: 14': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}, 'Unnamed: 15': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}, 'Unnamed: 16': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan, 6: nan, 7: nan}}

Created a specimen excel of what I have

VS

An excel of transformed data

Answer 1

So, we could rework the column names a bit by adding the first row to it (this corrects the wanted columns whose names start with "Unnamed"). Then filter out the column names that start with "Unnamed" (the remaining unwanted ones), then using "Date" column, create a mask and filter the DataFrame:

df = df.rename(columns={**{f'Unnamed: {i}': j for i,j in zip((9,11,13), 
                                                             ('Amount in transaction currency',
                                                              'Amount in accounting currency',
                                                              'Amount in reporting currency'))}, 
                        **{'Unnamed: 7': 'Currency'}})
df.columns = [f'{col}_{first}' if first==first else col for col, first in zip(df.columns, df.loc[0])]
df = df[df.columns[~df.columns.str.startswith('Unnamed')]]
date_filter = df['Date'].apply(isinstance, args=(datetime.datetime,))
df = df[date_filter]

The above code with an explicit loop to construct the list to modify column names:

cols = {'Unnamed: 7': 'Currency', 'Unnamed: 9': 'Amount in transaction currency', 
        'Unnamed: 11': 'Amount in accounting currency', 'Unnamed: 13': 'Amount in reporting currency'}
df = df.rename(columns=cols)
another_cols = []
for col, first in zip(df.columns, df.loc[0]):
    if first==first:
        another_cols.append(f'{col}_{first}')
    else:
        another_cols.append(col)
df.columns = another_cols
df = df[df.columns[~df.columns.str.startswith('Unnamed')]]
date_filter = df['Date'].apply(isinstance, args=(datetime.datetime,))
df = df[date_filter]

Output:

                  Date Journal number      Voucher    Posting type  \
1  2021-01-01 00:00:00       AX009473  TSPN-2021-3             NaN   
2  2021-01-01 00:00:00       AX009473  TSPN-2021-3             NaN   
3  2021-01-29 00:00:00       AX003312   GBJ-2021-1  Ledger journal   
4  2021-01-31 00:00:00       AX009641   VIT-2021-1  Ledger journal   
  Ledger account             Description Currency  \
1       00388211                     NaN      SGD   
2       00388211                     NaN      USD   
3       00388211  DISBERSMENT FOR PETROL      SGD   
4       00388211               TAXI FAIR      SGD   
  Amount in transaction currency_Debit Amount in transaction currency_Credit  \
1                                13.55                                     0   
2                                    0                                    25   
3                                    0                                    52   
4                                    5                                     0   
  Amount in accounting currency_Debit Amount in accounting currency_Credit  \
1                               13.55                                    0   
2                                   0                                33.42   
3                                   0                                   52   
4                                   5                                    0   
  Amount in reporting currency_Debit Amount in reporting currency_Credit  
1                              13.55                                   0  
2                                  0                               33.42  
3                                  0                                  52  
4                                  5                                   0  
Answer 2

Presumably, the first 3 rows and the rows with "totals" and "sum values" have empty cells, so dropna should eliminate those rows. Then drop_duplicates with keep=False parameter should drop the the duplicate column names:

out = df.replace('', np.nan).replace(' ', np.nan).dropna().drop_duplicates()

I just can't test it because your read_clipboard complains about the format of your data.

Answer 3

You could try to remove the unwanted rows like this.

df= df[~df['first_column_name'].str.startswith(('Company name','Account Name','Account No.','cabinet_name'))]
Rent Charter Buses Company
READ ALSO
How to bulk rename JSON files?

How to bulk rename JSON files?

I am trying to rename and replace a certain text with the hexadecimal for thousands of JSON files

70
How to add legend to plotly graph_objects plot that describes the colors, not traces?

How to add legend to plotly graph_objects plot that describes the colors, not traces?

I would like to have a legend on the right side of my plot that has 10 different color points & (representing the 10 different clusters in the graph below)I've provided two lists before, one that contains the color of each point, and the other list contains...

101
How to replace body text in MIME message?

How to replace body text in MIME message?

I am trying to send an automated email to various users with the lake that they are subscribed to changing:

76