Replacing numerical IDs contained as lists of strings and lists of ints in a Pandas DataFrame

131
April 12, 2022, at 3:00 PM

In order to make data anonymous, I need to replace original IDs with a different set of new IDs, but still have identical original IDs match across all fields after being replaced. The challenge is doing so elegantly across 4 different presentations of IDs in this Pandas DataFrame.

I have real-world data where numerical IDs are found in 4 possible formats:

  1. Within a list of strings '["38", "15", "42"]'
  2. Within a list of numbers [14, 42, 94]
  3. As integers 42
  4. As floats 1.0

Here is a generalized small DataFrame with all 4 data types.

df = pd.DataFrame([['["38", "15", "42"]', [14, 42, 94], 42, 1.0],\
 ['["8", "28"]', [1, 4], 8, 94.0], ['["12"]', [12], 12, 12.0]],\
 columns = ['CommentsID','AgentID','CaseID','TicketID'])
df
| CommentsID            | AgentID         | CaseID  | TicketID |
| --------------------- | --------------- | ------- | -------- |
| ['38', '15', '42']    | [14, 42, 94]    | 42      | 1.0      |
| ['8', '28']           | [1, 4]          | 8       | 94.0     |
| ['12']                | [12]            | 12      | 12.0     |

For ease of use in a generalized example, I'm just adding 100 to generate a list of 'new IDs'. However, in the actual problem, the list of corresponding new IDs are randomly generated, so no adding 100 throughout to solve for this one.

orig_ids = list(range(100))
new_ids = [x + 100 for x in orig_ids]

What I would like is to find the most efficient way to replace all of the original IDs in the dataframe with the new IDs across these four data types.

Best solution I have so far is to split into three parts:

  1. Tackle the float and int versions with a replace() function (this does not affect the lists, even with regex=True):
df = df.replace(orig_ids, new_ids)
  1. For the lists of integers, use a far-from-Pythonic doubled for-loop to match indices on the ID lists:
def newIDnumbers(datacol):
    newlist = []
    for i in range(len(datacol)):
        numlist = [orig_ids.index(x) for x in df.AgentID[i]]
        newlistrow = []
        for idx in range(len(numlist)):
            newlistrow.append(new_ids[numlist[idx]])
        newlist.append(newlistrow)
    return newlist
df.AgentID = newIDnumbers(df.AgentID)                     
df
  1. For the lists of strings, build string lists of original ids and new ids, then use a far-from-Pythonic doubled for-loop to match indices on the ID lists:
str_orig_ids = [str(x) for x in orig_ids]
str_new_ids = [str(x) for x in new_ids]
def newIDstrings(datacol):
    newlist = []
    for i in range(len(datacol)):
        numlist = [str_orig_ids.index(x) for x in datacol.str.findall(r'"(\d*)"')[i]]
        newlistrow = []
        for idx in range(len(numlist)):
            newlistrow.append(str_new_ids[numlist[idx]])
        newlist.append(newlistrow)
    return newlist
df.CommentsID = [str(x) for x in newIDstrings(df.CommentsID)]
df       

Does anyone have a more elegant and less computationally-heavy way to achieve this output?

df
| CommentsID            | AgentID         | CaseID  | TicketID |
| --------------------- | --------------- | ------- | -------- |
| ['138', '115', '142'] | [114, 142, 194] | 142     | 100.0    |
| ['108', '128']        | [101, 104]      | 108     | 194.0    |
| ['112']               | [112]           | 112     | 112.0    |
Answer 1

Here is one way using reshaping and factorize to anonymize the IDs:

from ast import literal_eval
def df_factorize(df):
    idx = df.index
    s = df.reset_index(drop=True).stack()  # stack to ensure consistent
    s[:] = s.factorize()[0]                # factors among all columns
    return s.unstack().set_index(idx)
df2 = (df
   .assign(CommentsID=df['CommentsID'].apply(literal_eval))   # extract as integers
   .explode(['CommentsID', 'AgentID'])                        # lists to rows
   .astype({'CommentsID': int}) # change str to int
   .pipe(df_factorize)                                        # anonymize
   .groupby(level=0)                                          # below to reshape
   .agg({'CommentsID': lambda s: str(list(s.astype(str))),    # to original form
         'AgentID': list,
         'CaseID': 'first',
         'TicketID': 'first',
        })
   .astype(df.dtypes)                                         # original dtypes
)

output:

        CommentsID    AgentID  CaseID  TicketID
0  ['0', '4', '2']  [1, 2, 5]       2       3.0
1       ['6', '7']     [3, 8]       6       5.0
2            ['9']        [9]       9       9.0

In case you want "real" anonymization, you could use a uuid:

from uuid import uuid4
def df_uuid(df):
    idx = df.index
    s = df.reset_index(drop=True).stack()
    f = s.unique()
    s = s.map(dict(zip(f, [str(uuid4()) for _ in range(len(f))])))
    return s.unstack().set_index(idx)

output:

                                                                                                                 CommentsID                                                                                                             AgentID                                CaseID                              TicketID
0  ['e9fa80ed-58e2-4a96-a8da-0bc0c3a87e14', 'c87ca55a-e0a4-4618-ab5a-2eaf68d5a70f', '7cba72a8-3f2a-42e7-a781-b8ed242ea2ac']  [65f95f56-877e-4a9f-be88-4302fe1f77ea, 7cba72a8-3f2a-42e7-a781-b8ed242ea2ac, 3cf48cd7-e1b0-4c7f-bbeb-41836e309511]  7cba72a8-3f2a-42e7-a781-b8ed242ea2ac  549f1768-39b5-4e74-a8dd-4edbf3bb591f
1                                          ['6e54d7d6-ae2f-4319-950e-7629128d518a', '8fef2e1c-99d1-4460-b8a4-f4dbbf213c5f']                                        [549f1768-39b5-4e74-a8dd-4edbf3bb591f, 98f82da0-fa9c-447a-943b-b11875736128]  6e54d7d6-ae2f-4319-950e-7629128d518a  3cf48cd7-e1b0-4c7f-bbeb-41836e309511
2                                                                                  ['3396c9db-1244-4b2d-bb76-14bb0221778f']                                                                              [3396c9db-1244-4b2d-bb76-14bb0221778f]  3396c9db-1244-4b2d-bb76-14bb0221778f  3396c9db-1244-4b2d-bb76-14bb0221778f
Answer 2

If adding 100 is sufficient,

import json
df['CaseID'] = df['CaseID'] + 100
df['TicketID'] = df['TicketID'] + 100
df['AgentID'] = df['AgentID'].apply(lambda x: list(map(lambda y: y+100, x)))
df['CommentsID'] = df['CommentsID'].apply(lambda x: json.dumps(list(map(lambda y: str(int(y)+100), json.loads(x)))))

For AgentID and CommentsID Series, we can use apply to apply transformation one by one.

For AgentID, this is easier because it is a column of list, so we only need to map each integer in the list to the addition.

For CommentsID, we need an additional step at the beginning to convert the list in str into a python list by json.loads, and another additional step at the end to convert the python list back to str using json.dumps.

Rent Charter Buses Company
READ ALSO
How to compress pandas dataframe

How to compress pandas dataframe

Below I am showing few entries of my dataframeMy (each) dataframe has millions row

129
How to make a custom python class to support memoryview conversion

How to make a custom python class to support memoryview conversion

Is it possible for a custom class to implement memoryview(obj)?

89
Using pd.read_html to return a specific table from a webpage of multiple tables

Using pd.read_html to return a specific table from a webpage of multiple tables

I am trying to return a specific table from this webpage

69
Django - get data from form and return to user so they can edit/delete

Django - get data from form and return to user so they can edit/delete

New to Django and making a restuarant booking systemI have everything going to my database but now i'm trying to return the info back to the user, ideally in a similar form view to what i have, so they can edit or delete the booking

80