Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I am also having a problem with this, but I am using redis as a datastore.</p> <p>this is my crawler:</p> <pre><code>require "rubygems" require "anemone" urls = File.open("urls.csv") opts = {discard_page_bodies: true, skip_query_strings: true, depth_limit:2000, read_timeout: 10} File.open("results.csv", "a") do |result_file| while row = urls.gets row_ = row.strip.split(',') if row_[1].start_with?("http://") url = row_[1] else url = "http://#{row_[1]}" end Anemone.crawl(url, options = opts) do |anemone| anemone.storage = Anemone::Storage.Redis puts "crawling #{url}" anemone.on_every_page do |page| next if page.body == nil if page.body.downcase.include?("sometext") puts "found one at #{url}" result_file.puts "#{row_[0]},#{row_[1]}" next end # end if end # end on_every_page end # end crawl end # end while # we're done puts "We're done." end # end File.open </code></pre> <p>I applied the patch from <a href="https://github.com/wordtracker/anemone/commit/487601ba6e4012320da4983e38ba94011097e6c9" rel="nofollow">here</a> to my core.rb file in the anemone gem:</p> <pre><code>35 # Prevent page_queue from using excessive RAM. Can indirectly limit ra te of crawling. You'll additionally want to use discard_page_bodies and/or a non-memory 'storage' option 36 :max_page_queue_size =&gt; 100, </code></pre> <p>...</p> <p>(The following used to be on line 155)</p> <pre><code>157 page_queue = SizedQueue.new(@opts[:max_page_queue_size]) </code></pre> <p>and I have an hourly cron job doing:</p> <pre><code>#!/usr/bin/env python import redis r = redis.Redis() r.flushall() </code></pre> <p>to try and keep redis' memory usage down. I'm restarting a giant crawl now, so we'll see how it goes!</p> <p>I'll report back with results...</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload