Note that there are some explanatory texts on larger screens.

plurals
  1. POWeb crawler: Using Perl's MozRepl module to deal with Javascript
    primarykey
    data
    text
    <p>I am trying to save a couple of web pages by using a web crawler. Usually I prefer doing it with perl's <code>WWW::Mechanize</code> modul. However, as far as I can tell, the site I am trying to crawl has many javascripts on it which seem to be hard to avoid. Therefore I looked into the following perl modules</p> <ul> <li><a href="http://search.cpan.org/~corion/WWW-Mechanize-Firefox-0.55/lib/WWW/Mechanize/Firefox.pm" rel="nofollow">WWW::Mechanize::Firefox</a></li> <li><a href="http://search.cpan.org/~zigorou/MozRepl-0.06/lib/MozRepl.pm" rel="nofollow">MozRepl</a></li> <li><a href="http://search.cpan.org/~corion/MozRepl-RemoteObject-0.28/lib/MozRepl/RemoteObject.pm" rel="nofollow">MozRepl::RemoteObject</a></li> </ul> <p>The Firefox <a href="https://github.com/bard/mozrepl/wiki" rel="nofollow">MozRepl extension</a> itself works perfectly. I can use the terminal for navigating the web site just the way it is shown in the developer's tutorial - in theory. However, I have no idea about javascript and therefore am having a hard time using the moduls properly. </p> <p>So here is the source i like to start from: <a href="http://www.morganstanley.com/eqr/disclosures/webapp/coverage" rel="nofollow">Morgan Stanley</a></p> <p>For a couple of listed firms beneath 'Companies - as of 10/14/2011' I like to save their respective pages. E.g. clicking on the first listed company (i.e. '1-800-Flowers.com, Inc') a javascript function gets called with two arguments -> <code>dtxt('FLWS.O','2011-10-14')</code>, which produces the desired new page. The page I now like to save locally.</p> <p>With perl's <code>MozRepl</code> module I thought about something like this:</p> <pre><code>use strict; use warnings; use MozRepl; my $repl = MozRepl-&gt;new; $repl-&gt;setup; $repl-&gt;execute('window.open("http://www.morganstanley.com/eqr/disclosures/webapp/coverage")'); $repl-&gt;repl_enter({ source =&gt; "content" }); $repl-&gt;execute('dtxt("FLWS.O", "2011-10-14")'); </code></pre> <p>Now I like to save the produced HTML page.</p> <p>So again, the desired code I like to produce should visit for a couple of firms their HTML site and simply save the web page. (Here are e.g. three firms: MMM.N, FLWS.O, SSRX.O)</p> <ol> <li>Is it correct, that I cannot go around the page's javascript functions and therefore cannot use <code>WWW::Mechanize</code>?</li> <li>Following question 1, are the mentioned perl modules a plausible approach to take?</li> <li>And finally, if you say the first two questions can be anwsered with yes, it would be really nice if you can help me out with the actual coding. E.g. in the above code, the essential part which is missing is a <code>'save'-command</code>. (Maybe using Firefox's <code>saveDocument</code> function?)</li> </ol>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload