StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POAre there any apache configurations to enhance web crawling performance?
text
Body
copied!<p>I have a php web crawler which, when run on localhost, often freezes after a few pages, leaving my web browser to show a loading sign and nothing more.</p> <p>I have checked through my code, there could be an error with it .. though after looking at it for the last few hours I am ready to explore other possibilities. </p> <p>When my scraper is running, it dumps information as different processes begin and end. I also frequently flush(); to ensure the browser is showing them ost up to date message. (this gives a console type look to the browser).</p> <p>The reason I am looking into Apache configuration now is because my program doesn't always freeze in the same place. Sometimes it freezes when it is searching for the 'a' tags for new URLs to add to the queue, other times it freezes when downloading the xhtml data itself, at this point:</p> <pre><code> private function _getXhtml() { $curl = curl_init(); if (!$curl) { throw new Exception('Unable to init curl. ' . curl_error($curl)); } curl_setopt($curl, CURLOPT_URL, $this->_urlCurrent); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); // Faking user agent curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)'); $xhtml = curl_exec($curl); if (!$xhtml) { $xhtml = null; echo 'PROBLEM' . $this->_urlCurrent . '<br />'; //throw new Exception('Unable to read XHTML. ' . curl_error($curl)); } curl_close($curl); return $xhtml; } </code></pre> <p>Besides all of this, I am wondering if there are any amendments to apache's configuration file or PHP.ini that I can do to enhance the localhost environment for web scraping?</p> <p>Any help would be appreciated.</p> <p>UPDATE:</p> <p>I believe I have narrowed it down to Zend_Dom_Query. This is why my application crashes at different stages (sometimes when it's grabbing a href for the crawling list and other times when it's looking for certain things within the page to 'harvest')</p> <p>Here's an example of my output.</p> <p><strong>Here, the application is crashing on the first page.. while getting a url.</strong></p> <pre><code> string(25) "Scraping page number 0..." string(9) "Mining..." string(15) "Getting <a>...." string(24) "Finished getting <a>...." string(20) "Getting <a href>...." string(43) "Mining page http://www.a-site.co.uk/ ..." string(17) "New page found..." string(18) "Page confirmed...." string(29) "Finished Getting <a href>...." string(20) "Getting <a href>...." string(43) "Mining page http://www.a-site.co.uk/ ..." string(29) "Finished Getting <a href>...." string(20) "Getting <a href>...." </code></pre> <p><strong>And here, the application is failing while extracting a element</strong></p> <pre><code>string(25) "Scraping page number 5..." string(9) "Mining..." //This bit loops for around 70 URLS string(15) "Getting <a>...." string(24) "Finished getting <a>...." string(20) "Getting <a href>...." string(48) "Mining page http://www.a-site.org ..." string(29) "Finished Getting <a href>...." //end loop string(70) "Harvesting http://www.a.site.org/a-url-path/..." string(19) "Harvesting html element..." </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload