StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POLWP::Simple - how to implement a loop into it [with live demo]
text
Body
copied!<p>good evening dear community! </p> <p>i want to process multiple webpages, kind of like a web spider/crawler might. I have some bits - but now i need to have some improved spider-logic. See the target-url <a href="http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50" rel="nofollow">http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50</a></p> <p>This page has got more than 6000 results! Well how do i get all the results? I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records</p> <p><strong>Attempt:</strong> Here are the first 5 page URLs: </p> <pre><code>http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200 </code></pre> <p>We can see that the "s" attribute in the URL starts at 0 for page 1, then increases by 50 for each page there after. We can use this information to create a loop: </p> <pre><code>my $i_first = "0"; my $i_last = "6100"; my $i_interval = "50"; for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; #process pageurl } </code></pre> <p>tadmc (a very very supportive user) has created a great script that puts out a cvs-formated results. i have build in this loop in the code: (Note - i guess that there has gone wrong something! See the musings below... with the code-snippets and the error-messages:</p> <pre><code>#!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TableExtract; use Text::CSV; my $i_first = "0"; my $i_last = "6100"; my $i_interval = "50"; for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; #process pageurl } my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; $html =~ tr/r//d; # strip the carriage returns $html =~ s/&nbsp;/ /g; # expand the spaces my $te = new HTML::TableExtract(); $te->parse($html); my @cols = qw( rownum number name phone type website ); my @fields = qw( rownum number name street postal town phone fax type website ); my $csv = Text::CSV->new({ binary => 1 }); foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { trim leading/trailing whitespace from base fields s/^s+//, s/\s+$// for @$row; load the fields into the hash using a "hash slice" my %h; @h{@cols} = @$row; derive some fields from base fields, again using a hash slice @h{qw/name street postal town/} = split /n+/, $h{name}; @h{qw/phone fax/} = split /n+/, $h{phone}; trim leading/trailing whitespace from derived fields s/^s+//, s/\s+$// for @h{qw/name street postal town/}; $csv->combine(@h{@fields}); print $csv->string, "\n"; } } </code></pre> <p>There have been some issues - i have made a mistake i guess that the error is here: </p> <pre><code>for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; #process pageurl } my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; $html =~ tr/r//d; # strip the carriage returns $html =~ s/&nbsp;/ /g; # expand the spaces </code></pre> <p>i have written down some kind of double - code. I need to leave out one part ... this one here</p> <pre><code>my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; $html =~ tr/r//d; # strip the carriage returns $html =~ s/&nbsp;/ /g; # expand the spaces </code></pre> <p>see the results in the command line: </p> <pre><code>martin@suse-linux:~> cd perl martin@suse-linux:~/perl> perl bavaria_all_.pl Possible unintended interpolation of %h in string at bavaria_all_.pl line 52. Possible unintended interpolation of %h in string at bavaria_all_.pl line 52. Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52. Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52. syntax error at bavaria_all_.pl line 59, near "/," Global symbol "%h" requires explicit package name at bavaria_all_.pl line 59. Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60. Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60. Substitution replacement not terminated at bavaria_all_.pl line 63. martin@suse-linux:~/perl> </code></pre> <p>what do you think!? look forward to hear from you </p> <p>btw - see the code, <strong>created by tadmc</strong>, without any improved spider-logic....This runs very very nciely - without any issue: it spits out a nice formatted cvs-output!!</p> <pre><code>#!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TableExtract; use Text::CSV; my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50'; $html =~ tr/r//d; # strip the carriage returns $html =~ s/&nbsp;/ /g; # expand the spaces my $te = new HTML::TableExtract(); $te->parse($html); my @cols = qw( rownum number name phone type website ); my @fields = qw( rownum number name street postal town phone fax type website ); my $csv = Text::CSV->new({ binary => 1 }); foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { trim leading/trailing whitespace from base fields s/^s+//, s/\s+$// for @$row; load the fields into the hash using a "hash slice" my %h; @h{@cols} = @$row; derive some fields from base fields, again using a hash slice @h{qw/name street postal town/} = split /n+/, $h{name}; @h{qw/phone fax/} = split /n+/, $h{phone}; trim leading/trailing whitespace from derived fields s/^s+//, s/\s+$// for @h{qw/name street postal town/}; $csv->combine(@h{@fields}); print $csv->string, "\n"; } } </code></pre> <p>Note: this above mentioned code runs nicely - it spits out csv-formated output. </p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload