how to remove a tag and its contents using regular expression?

You do not want to use regular expressions for this. A much better solution would be to load your contents into a DOMDocument and work on it using the DOM tree and standard DOM methods:

$document = new DOMDocument();
$document->loadXML('<root/>');
$document->documentElement->appendChild(
    $document->createFragment($myTextWithTags));

$MY_TAGs = $document->getElementsByTagName('MY_TAG');
foreach($MY_TAGs as $MY_TAG)
{
    $xmlContent = $document->saveXML($MY_TAG);
    /* work on $xmlContent here */

    /* as a further example: */
    $ems = $MY_TAG->getElementsByTagName('em');
    foreach($ems as $em)
    {
        $emphazisedText = $em->nodeValue;
        /* do your operations here */
    }
}

For removal I ended up just using this:

$str = preg_replace('~<MY_TAG(.*?)</MY_TAG>~Usi', "", $str);

Using ~ instead of / for the delimiter solved errors being thrown because of the backslash in the end tag, which seemed to be an issue even with escaping. Eliminating > from the opening tag allows for attributes or other characters and still gets the tag and all of its contents.

This only works where nesting is not a concern.

The Usi modifiers mean U = Ungreedy, s = include linebreak characters, i = case insensitive.


If MY_TAG can not be nested, try this to get the matches:

preg_match_all('/<MY_TAG>(.*?)<\/MY_TAG>/s', $str, $matches)

And to remove them, use preg_replace instead.


Although the only fully correct way to do this is not to use regular expressions, you can get what you want if you accept it won't handle all special cases:

preg_match("/<em[^>]*?>.*?</em>/i", $str, $match);
// Use this only if you aren't worried about nested tags.
// It will handle tags with attributes

And

preg_replace(""/<MY_TAG[^>]*?>.*?</MY_TAG>/i", "", $str);

Tags:

Php

Regex