Note that there are some explanatory texts on larger screens.

plurals
  1. POIn a DOM, how to find the right-most element that is left of a given element and matches criteria using lxml or xpath
    primarykey
    data
    text
    <p>I'm working on a function that determines if the content of a given html element - el - in an lxml ElementTree is the leading content of a line in a rendered HTML page. To do this, I'm trying to find the right-most block level element that is left of el, and then determine if there is content between these two. </p> <p>I figure this can happen via a traversal in the reverse order of a DFS, with the reverse traversal starting at el. But I've also been trying to find if a simpler method exists using lxml or xpath to do this. So far I've found ways to find elements that are ancestors or left siblings of a given element with some criteria, but I haven't spotted anything that works on the entire tree right (or left) of a specific node. </p> <p>Does anybody know of a simpler way to do this search using lxml or xpath? </p> <p><strong>Example</strong></p> <pre><code>&lt;html&gt; &lt;body class="first"&gt; root &lt;!-- A span that does not have its own content, but does have several levels of children--&gt; &lt;span&gt; &lt;a&gt; &lt;b&gt; &lt;h1 class="first"&gt; A block level that is the decendant of several non block levels &lt;/h1&gt; &lt;/b&gt; &lt;/a&gt; &lt;span class="first" id="tricky"&gt; A non-block level that has no block levels among its ancestors, but a block level element among its left cousins &lt;/span&gt; &lt;span&gt; A non-block level that has no block levels among its ancestors, and content between itself and its nearest left-cousin block level &lt;/span&gt; &lt;/span&gt; &lt;div class="first"&gt; a block level &lt;/div&gt; &lt;div&gt; &lt;span class="first"&gt;first content in a non block level in a block level&lt;/span&gt; &lt;span&gt;following content in a non block level in a block level&lt;/span&gt; &lt;/div&gt; &lt;div&gt; &lt;i&gt; &lt;/i&gt;&lt;bclass="first"&gt;a non block level that contains the first content within a block level, but follows an empty non-block level&lt;/b&gt; &lt;/div&gt; &lt;/body&gt; &lt;/html&gt; </code></pre> <p>In the above I've added a "first" class to any element that, when rendered, would appear to present the leading content of a line. Of particular interest is the element with id "tricky", because that element will present the first content of a line even though none of its ancestors nor its siblings are block level elements. "tricky" will be on a new line because a descendant of one of its siblings (the h1) is a block level, and there is no other content following that h1.</p> <p><strong>Follow Up</strong> At this point I have written a function in Python that does a type of backwards traversal. Its a bit complicated, but it seems to work:</p> <pre><code>block_level = {'blockquote','br','dd','div','dl','dt','h1','h2','h3','h4','h5','h6','hr','li','ol','p','pre','td','ul'} # Returns true if the content of the provided element is the leading content of a line # This function runs on HTML elements before any translation occurs # Here 'content' refers to non-whiespace characters def is_first_in_line_html(self, el): # This element contains no content, so it can't be the leading content of a line. if el.text is None or el.text.strip() == '': return False # This element has content and is a block level, so its content is the leading content of a line. if el.tag in block_level: return True # This element has content, is not a block level, and is the body element. Definitely leading content of a line. if el.tag == 'body': return True # Final case - is there content between the present element and the nearest block level element to the left of the present # element. def traverse_children(element, bound_text): children = element.iterchildren(reversed=True) for child in children: if child.tail is not None: bound_text = child.tail + bound_text if bound_text.strip() != '': return False if child.tag in block_level: return bound_text.strip() == '' rst_children = traverse_children(child, bound_text) if rst_children is not None: return rst_children if child.text is not None: bound_text = child.text + bound_text if bound_text.strip() != '': return False return None def traverse_left_sibs_and_ancestors(element, bound_text): left_sibs = element.itersiblings(preceding=True) for sib in left_sibs: if sib.tail is not None: bound_text = sib.tail + bound_text if bound_text.strip() != '': return False if sib.tag in block_level: return bound_text.strip() == '' rst_children = traverse_children(sib, bound_text) if rst_children is not None: return rst_children if sib.text is not None: bound_text = sib.text + bound_text if bound_text.strip() != '': return False parent = element.getparent() if parent.tail is not None: bound_text = parent.tail + bound_text if parent.tag == 'body': return bound_text.strip() == '' if parent.tag in block_level: return bound_text.strip() == '' return traverse_left_sibs_and_ancestors(parent) return traverse_left_sibs_and_ancestors(el, '') </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload