Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.</p> <p>The general saying is: Don't parse HTML with regular expressions.</p> <p>It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.</p> <p>XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.</p> <p>Then you only need to wrap those texts into the <code>&lt;span&gt;</code> and you're done.</p> <p><strong>Edit:</strong> Finally some code ;)</p> <p>First it makes use of <code>xpath</code> to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:</p> <pre><code>'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..' </code></pre> <p><code>$search</code> contains the text to search for, <em>not</em> containing any <code>"</code> (quote) character (this would break it, see <a href="https://stackoverflow.com/q/188834/367456">Cleaning/sanitizing xpath attributes</a> for a workaround if you need quotes). </p> <p>This query will return all parents that contain textnodes which put together will be a string that contain your search term.</p> <p>As such a list is not easy to process further as-is, I created a <code>TextRange</code> class that represents a list of <code>DOMText</code> nodes. It is useful to do string-operations on a list of textnodes as if they were one string.</p> <p>This is the base skeleton of the routine:</p> <pre><code>$str = '...'; # some XML $search = 'text that span'; printf("Searching for: (%d) '%s'\n", strlen($search), $search); $doc = new DOMDocument; $doc-&gt;loadXML($str); $xp = new DOMXPath($doc); $anchor = $doc-&gt;getElementsByTagName('body')-&gt;item(0); if (!$anchor) { throw new Exception('Anchor element not found.'); } // search elements that contain the search-text $r = $xp-&gt;query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor); if (!$r) { throw new Exception('XPath failed.'); } // process search results foreach($r as $i =&gt; $node) { $textNodes = $xp-&gt;query('.//child::text()', $node); // extract $search textnode ranges, create fitting nodes if necessary $range = new TextRange($textNodes); $ranges = array(); while(FALSE !== $start = strpos($range, $search)) { $base = $range-&gt;split($start); $range = $base-&gt;split(strlen($search)); $ranges[] = $base; }; // wrap every each matching textnode foreach($ranges as $range) { foreach($range-&gt;getNodes() as $node) { $span = $doc-&gt;createElement('span'); $span-&gt;setAttribute('class', 'search_hightlight'); $node = $node-&gt;parentNode-&gt;replaceChild($span, $node); $span-&gt;appendChild($node); } } } </code></pre> <p>For my example XML:</p> <pre><code>&lt;html&gt; &lt;body&gt; This is some &lt;span&gt;text&lt;/span&gt; that span across a page to search in. and more text that span&lt;/body&gt; &lt;/html&gt; </code></pre> <p>It produces the following result:</p> <pre><code>&lt;html&gt; &lt;body&gt; This is some &lt;span&gt;&lt;span class="search_hightlight"&gt;text&lt;/span&gt;&lt;/span&gt;&lt;span class="search_hightlight"&gt; that span&lt;/span&gt; across a page to search in. and more &lt;span class="search_hightlight"&gt;text that span&lt;/span&gt;&lt;/body&gt; &lt;/html&gt; </code></pre> <p>This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.</p> <p>You find the full code here: <a href="http://codepad.viper-7.com/U4bxbe" rel="nofollow noreferrer">http://codepad.viper-7.com/U4bxbe</a> (including the <code>TextRange</code> class that I have taken out of the answers example).</p> <p>It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: <a href="https://stackoverflow.com/q/8195733/367456">XPath query result order</a>.</p> <p><strong>A note of warning:</strong> This example uses binary string search (<code>strpos</code>) and the related offsets for splitting textnodes with the <a href="http://php.net/manual/en/domtext.splittext.php" rel="nofollow noreferrer"><code>DOMText::splitText</code></a> function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use <code>mb_strpos</code> to obtain the <code>UTF-8</code> based value.</p> <p>The example works anyway because it's only making use of <code>US-ASCII</code> which has the same offsets as <code>UTF-8</code> for the example-data.</p> <p>For a real life situation, the <code>$search</code> string should be UTF-8 encoded and <code>mb_strpos</code> should be used instead of <code>strpos</code>:</p> <pre><code> while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8')) </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload