RegEx for removing non ASCII characters from both ends

This expression is not bounded from the left side, and it might perform faster, if all your desired chars would be similar to the example you have provided in your question:

([a-z0-9;.-]+)(.*)

Here, we're assuming that you might just want to filter those special chars in the left and right parts of your input strings.

You can include other chars and boundaries to the expression, and you can even modify/change it to a simpler and faster expression, if you wish.

enter image description here

RegEx Descriptive Graph

This graph shows how the expression would work and you can visualize other expressions in this link:

enter image description here

If you wish to add other boundaries in the right side, you can simply do that:

([a-z0-9;.-]+)(.*)$

or even you can list your special chars both in the left and right of the capturing group.

JavaScript Test

const regex = /([a-z0-9;.-]+)(.*)$/gm;
const str = `!@#\$abc-123-4;5.def)(*&^;\\n`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Performance Test

This JavaScript snippet shows the performance of that expression using a simple loop.

const repeat = 1000000;
const start = Date.now();

for (var i = repeat; i >= 0; i--) {
	const string = '!@#\$abc-123-4;5.def)(*&^;\\n';
	const regex = /([!@#$)(*&^;]+)([a-z0-9;.-]+)(.*)$/gm;
	var match = string.replace(regex, "$2");
}

const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match  ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test.  ");

Python Test

import re

regex = r"([a-z0-9;.-]+)(.*)$"
test_str = "!@#$abc-123-4;5.def)(*&^;\\n"
print(re.findall(regex, test_str))

Output

[('abc-123-4;5.def', ')(*&^;\\n')]

You can accomplish this by using the carat ^ character at the beginning of a character set to negate its contents. [^a-zA-Z0-9] will match anything that isn't a letter or numeral.

^[^a-zA-Z0-9]+|[^a-zA-Z0-9]+$

To trim non word characters (upper \W) from start/end but also add the underscore which belongs to word characters [A-Za-z0-9_] you can drop the _ into a character class together with \W.

^[\W_]+|[\W_]+$

See demo at regex101. This is very similar to @CAustin's answer and @sln's comment.

To get the inverse ^demo and match everything from the first to the last alphanumeric character:

[^\W_](?:.*[^\W_])?

Or with alternation ^demo (|[^\W_] for strings having just one alnum in it).

[^\W_].*[^\W_]|[^\W_]

Both with re.DOTALL for multiline strings. Regex flavors without try [\s\S]* instead of .* ^demo

RegEx for removing non ASCII characters from both ends

RegEx Descriptive Graph

JavaScript Test

Performance Test

Python Test

Output

Tags:

Python

Regex

Related

Recent Posts