I want to clean my reviews data. Here's my code :
def processData(data):
data = data.lower() #casefold
data = re.sub('<[^>]*>',' ',data) #remove any html
data = re.sub(r'#([^\s]+)', r'\1', data) #Replace #word with word
remove = string.punctuation
remove = remove.replace("'", "") # don't remove '
p = r"[{}]".format(remove) #create the pattern
data = re.sub(p, "", data)
data = re.sub('[\s]+', ' ', data) #remove additional whitespaces
pp = re.compile(r"(.)\1{1,}", re.DOTALL) #pattern for remove repetitions
data = pp.sub(r"\1\1", data)
return data
This code almost work well, but there still a problem. For this sentence "she work in public-service" ,
I got "she work in publicservice".
The problem is there are no whitespace after string punctuation.
I want my sentence to be like this "she work in public service".
Can you help me with my code?
I think you want this:
>>> st = 'she works in public-service'
>>> import re
>>> re.sub(r'([{}])'.format(string.punctuation),r' ',st)
'she works in public service'
>>>
Controller Advice bean not instantiated at proper order in Spring Boot 2.4
What should I enter to the connection string to connect my NodeJS application to a MongoDB server?
Unable to use Computed Property Values with Dots - Unable to Set as String - JS
I have a 2d matrix with dimension (3, n) called A, I want to calculate the normalization and cross product of two arrays (b,z) (see the code please) for each column (for the first column, then the second one and so on)
I'm looking for a powerful and fast way to handle processing of large file in Google App Engine
I am trying to evaluate the density of multivariate t distribution of a 13-d vectorUsing the dmvt function from the mvtnorm package in R, the result I get is
Given a pandas dataframe in the following format: