Regex replace text but exclude when text is between specific tag

This should do the trick:

(?!<a[^>]*>)(Test)(?![^<]*</a>)

Try it yourself on regexr.

Follow-up: As Adam explains above, the first part has no effect and can be dropped entirely:

(Test)(?![^<]*</a>)

Answer

Use

(Test)(?!(.(?!<a))*</a>)

Explanation

Let me remind you of the meaning of some symbols:

1) ?! is a negative lookahead, for example r(?!d) selects all r that are not directly followed by an d:

enter image description here

2) Therefore never start a negative lookahead without a character. Just (?!d) is meaningless:

enter image description here

3) The ? can be used as a lazy match. For example .+E would select from

123EEE

the whole string 123EEE. However, .+?E selects as few "any charater" (.+) as needed. It would only select 123E.

Answer:

Protist answer is that you should use (?!<a[^>]*?>)(Test)(?![^<]*?</a>). Let me explain how to make this shorter first.

As mentioned in 2), it is meaningless to put a lookahead before the match. So the following is equivalent to protist answer:

(Test)(?![^<]*?</a>)

also since < is not allowed, the lazy match ? is superfluous, so its also equivalent to

(Test)(?![^<]*</a>)

This selects all Test that are not followed by an </a> without the symbol < in between. This is why Test which appears before or after any <a ...> .. </a> will be replaced.

However, note that

Lorem Test dolor <a href="http://Test.com/url">Test <strong>dolor</strong></a> eirmod

would be changed to

Lorem 1234 dolor <a href="http://1234.com/url">1234 <strong>dolor</strong></a> eirmod 

In order to catch that you could change your regex to

(Test)(?!(.(?!<a))*</a>)

which does the following:

Select every word Test that is not followed by a string ***</a> where each character in *** is not followed by <a.

Note that the dot . is important (see 2)).

Note that a lazy match like (Test)(?!(.(?!<a))*?</a>) is not relevant because nested links are illegal in HTML4 and HTML5 (smth like <a href="#">..<a href="#">...</a>..</a>).

protist said

Also, using regexes on raw HTML is not recommended.

I agree with that. A problem is that it would cause problems if a tag is not closed or opened. For example all mentioned solutions here would change

Lorem Test dolor Test <strong>dolor</strong></a> eirmod

to

Lorem Test dolor Test <strong>dolor</strong></a> eirmod 1234 dolores sea 1234 takimata 

(?!<a[^>]*?>)(Test)(?![^<]*?</a>)

same as zb226, but optimized with a lazy match

Also, using regexes on raw HTML is not recommended.

Tags:

Regex