How to add whitespace after string.punctuation in Python?

503
December 15, 2016, at 12:44 PM

I want to clean my reviews data. Here's my code :

def processData(data):
    data = data.lower() #casefold    
    data = re.sub('<[^>]*>',' ',data) #remove any html     
    data = re.sub(r'#([^\s]+)', r'\1', data) #Replace #word with word
    remove = string.punctuation
    remove = remove.replace("'", "") # don't remove '
    p = r"[{}]".format(remove) #create the pattern
    data = re.sub(p, "", data)
    data = re.sub('[\s]+', ' ', data) #remove additional whitespaces
    pp = re.compile(r"(.)\1{1,}", re.DOTALL) #pattern for remove repetitions 
    data = pp.sub(r"\1\1", data)
    return data

This code almost work well, but there still a problem. For this sentence "she work in public-service" ,

I got "she work in publicservice".

The problem is there are no whitespace after string punctuation.

I want my sentence to be like this "she work in public service".

Can you help me with my code?

Answer 1

I think you want this:

>>> st = 'she works in public-service'
>>> import re
>>> re.sub(r'([{}])'.format(string.punctuation),r' ',st)
'she works in public service'
>>> 
READ ALSO
get the column from 2d array to calculate the normalization and cross product in python

get the column from 2d array to calculate the normalization and cross product in python

I have a 2d matrix with dimension (3, n) called A, I want to calculate the normalization and cross product of two arrays (b,z) (see the code please) for each column (for the first column, then the second one and so on)

556
How to handle processing of large file on GAE?

How to handle processing of large file on GAE?

I'm looking for a powerful and fast way to handle processing of large file in Google App Engine

343
Density of multivariate t distribution in Python for large number of observations

Density of multivariate t distribution in Python for large number of observations

I am trying to evaluate the density of multivariate t distribution of a 13-d vectorUsing the dmvt function from the mvtnorm package in R, the result I get is

344
Pandas Dataframe grouping and standard deviation

Pandas Dataframe grouping and standard deviation

Given a pandas dataframe in the following format:

505