StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POExtracting links inside <div>'s with HTML::TokeParser & URI
text
Body
copied!<p>I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI. </p> <p>I need to extract ALL valid links enclosed within on div called "zone-extract"</p> <p>This is my code:</p> <pre><code>#More perl above here... use strict and other subs use HTML::TokeParser; use URI; sub extract_links_from_response { my $response = $_[0]; my $base = URI->new( $response->base )->canonical; # "canonical" returns it in the one "official" tidy form my $stream = HTML::TokeParser->new( $response->content_ref ); my $page_url = URI->new( $response->request->uri ); print "Extracting links from: $page_url\n"; my($tag, $link_url); while ( my $div = $stream->get_tag('div') ) { my $id = $div->get_attr('id'); next unless defined($id) and $id eq 'zone-extract'; while( $tag = $stream->get_tag('a') ) { next unless defined($link_url = $tag->[1]{'href'}); next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL. next unless length $link_url; # sanity check! $link_url = URI->new_abs($link_url, $base)->canonical; next unless $link_url->scheme eq 'http'; # sanity $link_url->fragment(undef); # chop off any "#foo" part print $link_url unless $link_url->eq($page_url); # Don't note links to itself! } } return; } </code></pre> <p>As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...</p> <p><b>The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div <i>'zone-extract'</i>... Im using this post as a reference: <a href="https://stackoverflow.com/questions/1692362/how-can-i-find-the-contents-of-a-div-using-perls-html-modules-if-i-know-a-tag">How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?</a> </b></p> <p>But all I have by the moment is this error:</p> <pre><code>Can't call method "get_attr" on unblessed reference </code></pre> <p>Some ideas? Help! My HTML (Note URL_TO_EXTRACT_1 & 2):</p> <pre><code><more html above here> <div class="span-48 last"> <div class="span-37"> <div id="zone-extract" class="..."> <h2 class="genres"><img alt="extracting" class="png"></h2> <li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li> <li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li> <li class="first">Pàg</li> </div> </div> </div> <more stuff from here> </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload