xpath exclude element and all its children by parent attribute containing a value

You can apply the Kaysian method for obtaining the intersection of a set. You have two sets:

A: The elements which descend from //div[contains(@class, 'post-content')], excluding the current element (since you don't want the root div):

//*[ancestor::div[contains(@class, 'post-content')]]

B: The elements which descend from //*[not(contains(@class, 'image-container'))], including the current element (since you want to exclude the entire tree, including the div and span):

//*[not(ancestor-or-self::*[contains(@class, 'image-container')])] 

The intersection of those two sets is the solution to your problem. The formula of the Kaysian method is: A [ count(. | B) = count(B) ]. Applying that to your problem, the result you need is:

//*[ancestor::div[contains(@class, 'post-content')]]
   [ count(. | //*[not(ancestor-or-self::*[contains(@class, 'image-container')])])
     = 
     count(//*[not(ancestor-or-self::*[contains(@class, 'image-container')])]) ]

This will select the following elements from your example code:

/div/p
/div/p/moredepth
/div/p/moredepth/...
/div/p/moredepth/.../p
/div/p/moredepth/.../li

excluding the span and the div that match the unwanted class, and its descendants.

You can then add extra steps to the expression to filter out exactly which text or nodes you need.


XPath does not allow manipulating a fragment of XML once it is returned to you by a path expression. So, you cannot select moredepth:

//moredepth

without getting as a result all of this element node, including all descendant nodes that you'd like to exclude:

<moredepth>
<span class="image-container float_right">
<div class="some_element">
image1
</div>
<p>do not need this</p>
</span>
<div class="image-container float_right">
image2
</div>
<p>text1</p>
<li>text2</li>
</moredepth>

What you can do is only select the child nodes of moredepth:

//div[contains(@class, 'post-content')]/p/moredepth/*[not(contains(@class,'image-container'))]

which will yield (individual results separated by -------):

<p>text1</p>
-----------------------
<li>text2</li>