Regular expression in bash or sed

15
March 14, 2019, at 8:10 PM

I have a regular expression (PHP) to clean the string from file:

return  preg_replace('/[^A-Za-z0-9  \n \)\(\,\%\\@\!?\#\&\;\'\"\-\+.\/"]/','', $string);

I'm using Ubuntu and want to clean the file content using bash or sed? How can I do this? Thanks!

Answer 1

Remove non-ASCII characters

You appear to simply want to strip out non-ASCII characters (though you're missing each of $*:<=>[]^_`{|}~ and I don't know if that's intentional). There are several ways to do this, including a command written for this express purpose.

  • strings FILENAME
  • tr -cd '[\t\r\n -~]' < FILENAME
  • sed 's/[^\t\r\n -~]//g' FILENAME

The strings utility does this automatically and is great for quickly checking the contents of a binary file with safe output for the terminal. You may dislike the way it separates blocks of text with line breaks.

The other two commands take a list of characters (including ranges by character code) and removes them. In tr (short for "translate"), the -c option gets the complement of the list and the -d means delete matches rather than translating them. In sed (short for "streamline editor"), I'm running a s/// substitution on an inverted character set like the one you used in your PHP code and replacing each match (the /g flag matches globally) with an empty string.

The character set (okay, technically that's not the right term for tr usage, e.g. you can't negate it like [^…], but that's why we use tr -c) calls out a few white space characters (tab, carriage return, line feed) and then specifies the range of characters from space () to tilde (~), covered by the codes U+0020 to U+007e.

You may run across [!-~] as well. That's shorthand for all printable ASCII characters. Spaces are not printable, which is why I had to name them explicitly, though at least the space character (U+0020) immediately precedes exclamation (!, U+0021) so I could just lump that into our range.

Remove just your listed characters

This requires preserving the list, though I can collapse it taking advantage of any contiguous character codes:

sed 's/[^\t\r\n -#%-)+-9;?-Z\\a-z]//g' FILENAME

Explanation of above regex. Compare it to your regex or to the more comprehensive non-ASCII regex from the previous section (I added Latin-1 Supplemental to that last link's test set so you can see that it actually matches something).

In place

If you want to save to the same file, you can run sed -i COMMAND FILENAME using either of the s/// commands listed above.

READ ALSO
Wampserver php root relative path not working anymore

Wampserver php root relative path not working anymore

I am sorry for asking similiar question that have already been asked, but I did not find an answer to my problem

8
Merge Two Arrays to Have an Integer Value be an near equal as possible

Merge Two Arrays to Have an Integer Value be an near equal as possible

Application to Distribute Stock between Warehouses

15
I have 3 TLD .com .es .ru And I have a question

I have 3 TLD .com .es .ru And I have a question

I have a webpage translated into 3 languages, my question isI would like when you enter with

36
Adding postfix to WooCommerce product variation gets overwritten

Adding postfix to WooCommerce product variation gets overwritten

Im beginning to expect that there is some "update function" built into WooCommerce that only lets me rename variations post_title for a little whileAnd then it gets set back to what the hooks / WooCommerce has decided?

23