Extracting a pattern from python string using regular expression

233
February 16, 2018, at 11:15 PM

I have a string from one of the log files as below.

pf_string = "2018-02-01 00:54:49,285 [210.67.123.00]  [ABC,CDE,sfv4_ABC.,dbPool5,11689563,fp2871,en_US]  UNKNOWN-UNKNOWN EVENT-UNKNOWN-UNKNOWN-pc4bcf46t-20180201005446-663570 2994 770 3199 168 26 [Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; ABC-IE11; rv:11.0) like Gecko]     3677610951-0 PERFORMANCE PM_REVIEW FORM_DETAIL [[95211KB 480ms 460ms 20ms 212KB 0KB 118KB 57KB 0 0 ]] 74 139 - - - -   "

Now I want to extract a pattern like below:

Module_id -> PERFORMANCE 
Page Name -> PM_REVIEW 
Page Qualifier -> FORM_DETAIL

Here is a regular expression like below:

perfLogPatternPage = re.compile('(?P<module_id>\w+)\s(?P<page_name>\w+)\s(?P<page_qualifier>\w+)\s\[\[')

print perfLogPatternPage.match(pf_string).group('module_id')
print perfLogPatternPage.match(pf_string).group('page_name')
print perfLogPatternPage.match(pf_string).group('page_qualifier')

But this doesn't seem to work and give the right result.

Can someone suggest what's wrong?

Answer 1

It would be enough to apply re.search() function at once:

import re
pf_string = "2018-02-01 00:54:49,285 [210.67.123.00]  [ABC,CDE,sfv4_ABC.,dbPool5,11689563,fp2871,en_US]  UNKNOWN-UNKNOWN EVENT-UNKNOWN-UNKNOWN-pc4bcf46t-20180201005446-663570 2994 770 3199 168 26 [Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; ABC-IE11; rv:11.0) like Gecko]     3677610951-0 PERFORMANCE PM_REVIEW FORM_DETAIL [[95211KB 480ms 460ms 20ms 212KB 0KB 118KB 57KB 0 0 ]] 74 139 - - - -   "
m = re.search(r'(?P<module_id>\w+)\s+(?P<page_name>\w+)\s+(?P<page_qualifier>\w+)\s(?=\[\[.)', pf_string)
module_id, page_name, page_qualifier = m.groups()
Answer 2

You can do:

\d+-\d+\s+(?P<module_id>[A-Z_]+)\s+(?P<page_name>[A-Z_]+)\s+(?P<page_qualifier>[A-Z_]+)
  • \d+-\d+\s+ matches one or more digits, followed by -, followed by one or more digits, then one or more spaces

  • The each named captured group matches one or more uppercased-alphabetic characters or underscore

  • The \s+ in between the captured groups matches one or more spaces

Example:

In [12]: rcomp = re.compile(r'\d+-\d+\s+(?P<module_id>[A-Z_]+)\s+(?P<page_name>[A-Z_]+)\s+(?P<page_qualifier>[A-Z_]+)')
In [13]: out = rcomp.search(pf_string)
In [14]: out.group('module_id')
Out[14]: 'PERFORMANCE'
In [15]: out.group('page_name')
Out[15]: 'PM_REVIEW'
In [16]: out.group('page_qualifier')
Out[16]: 'FORM_DETAIL'
Answer 3

Your regex requires a few corrections:

  • Start from the start of the string (^).
  • "Consume" three times:
    • A sequence of chars other than [.
    • [ char.
    • A sequence of chars other than ].
    • ] char.
  • "Consume" a sequence of spaces (actually white chars), a sequence of digits or - chars and another sequence of spaces.
  • Then put your 3 named capturing groups, separated with a sequence of spaces.

So the whole regex can look like below:

^(?:[^\[]+\[[^\]]+\]){3}\s+[-\d]+\s+(?P<module_id>\w+)\s+(?P<page_name>\w+)\s+(?P<page_qualifier>\w+)

For a working example see https://regex101.com/r/e048Q3/1

Answer 4

You can try this:

import re
pf_string = "2018-02-01 00:54:49,285 [210.67.123.00]  [ABC,CDE,sfv4_ABC.,dbPool5,11689563,fp2871,en_US]  UNKNOWN-UNKNOWN EVENT-UNKNOWN-UNKNOWN-pc4bcf46t-20180201005446-663570 2994 770 3199 168 26 [Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; ABC-IE11; rv:11.0) like Gecko]     3677610951-0 PERFORMANCE PM_REVIEW FORM_DETAIL [[95211KB 480ms 460ms 20ms 212KB 0KB 118KB 57KB 0 0 ]] 74 139 - - - -"
results = dict(zip(['Module_id', 'Page Name', 'Page Qualifier'], re.findall('(?<=\-\d)[a-zA-Z\s_]+(?=\[\[\d)', pf_string)[0].split()))

Output:

{'Module_id': 'PERFORMANCE', 'Page Qualifier': 'FORM_DETAIL', 'Page Name': 'PM_REVIEW'}
READ ALSO
(sorta) Random Hex Characters Sim900

(sorta) Random Hex Characters Sim900

I seem to be having hex characters show up in my program I wrote for working with Sim900The only times I have seen this before is when I was experimenting with a previous sim900 chip and tried to communicate with it without power connected (and I think I may have had the tx and rx reversed...

242
tkinter matchbox keyboard not focusing and appears underneath py program

tkinter matchbox keyboard not focusing and appears underneath py program

So I have a python3 program running on an Rpi that starts automatically using

199
Am I able to create a form that is able to add a new django.contrib.auth User without logging in to the admin panel?

Am I able to create a form that is able to add a new django.contrib.auth User without logging in to the admin panel?

Have created a form but unsure if is right and also unable to add a user, it will show TypeError/ This is how the form I want it to look like

159
TypeError when creating a new user

TypeError when creating a new user

After following this tutorial to create a signup view, I have encountered an error

284