Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>It's a slightly tricky one, but XML::Twig is designed (and I use it heavily) to this kind of processing. So there is a specific method, called <code>mark</code>, that takes a regexp and tags the matches.</p> <p>In this case the regexp will likely be quite big. I used Regexp::Assempble to build it, so it gets optimized. Then an other problem is that <code>mark</code> doesn't let you use the text of the match to set an attribute (I might work on this in the next version of the module, that would be useful), so I has to mark first, then go back and set the <code>href</code> attribute in a second pass (in any case the second pass is needed to "un-link" words that have already been linked).</p> <p>One last word: I nearly gave up on writing the solution, because your example data has a few typos. There is nothing worse than getting the code right, just to see the test still fail because you use 'dictionary' in the code and 'definitions' in the data, or 'furtykurtle', 'furtikurty' and 'furtijurty' where it should all be the same word. So please, before posting, make sure your data is right. Thankfully I was writing the code as a test.</p> <pre><code>#!/usr/bin/perl use strict; use warnings; use XML::Twig; use Regexp::Assemble; use Test::More tests =&gt; 1; use autodie qw(open); my %dictionary = ( frobnitz =&gt; 'definitions.html#frobnitz', crulps =&gt; 'definitions.html#crulps', furtikurty =&gt; 'definitions.html#furtikurty', ); my $match_defs= Regexp::Assemble-&gt;new() -&gt;add( keys %dictionary) -&gt;anchor_word -&gt;as_string; # I am not familiar enough with Regexp::Assemble to know a cleaner # way to get get the capturing braces in the regexp $match_defs= qr/($match_defs)/; my $in = data_para(); my $expected = data_para(); my $out; open( my $out_fh, '&gt;', \$out); XML::Twig-&gt;new( twig_roots =&gt; { 'description' =&gt; sub { tag_defs( @_, $out_fh, $match_defs, \%dictionary); } }, twig_print_outside_roots =&gt; $out_fh, ) -&gt;parse( $in); is( $out, $expected, 'base test'); exit; sub tag_defs { my( $t, $description, $out_fh, $match_defs, $dictionary)= @_; my @a= $description-&gt;mark( $match_defs, 'a' ); # word =&gt; 1 when already used in this description # this might need to have a different scope if you need to tag # only the first time the word appears in a section or whatever my $tagged_in_description; foreach my $a (@a) { my $word= $a-&gt;text; warn "checking a: ", $a-&gt;sprint, "\n"; if( $tagged_in_description-&gt;{$word}) { $a-&gt;erase; } # we did not need to tag it after all else { $a-&gt;set_att( href =&gt; $dictionary-&gt;{$word}); } $tagged_in_description-&gt;{$word}++; } $t-&gt;flush( $out_fh); } sub def_href { my( $word)= @_; return $dictionary{word}; } sub data_para { local $/="\n\n"; my $para= &lt;DATA&gt;; return $para; } __DATA__ &lt;article&gt; &lt;author&gt;Smith&lt;/author&gt; &lt;date&gt;2011-10-10&lt;/date&gt; &lt;description&gt;Article about &lt;b&gt;frobnitz&lt;/b&gt;, crulps and furtikurty's. Mainly frobnitz&lt;/description&gt; &lt;/article&gt; &lt;article&gt; &lt;author&gt;Smith&lt;/author&gt; &lt;date&gt;2011-10-10&lt;/date&gt; &lt;description&gt;Article about &lt;b&gt;&lt;a href="definitions.html#frobnitz"&gt;frobnitz&lt;/a&gt;&lt;/b&gt;, &lt;a href="definitions.html#crulps"&gt;crulps&lt;/a&gt; and &lt;a href="definitions.html#furtikurty"&gt;furtikurty&lt;/a&gt;'s. Mainly frobnitz&lt;/description&gt; &lt;/article&gt; </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload