Note that there are some explanatory texts on larger screens.

plurals
  1. PONutch and HBase for production
    primarykey
    data
    text
    <p>I am currently using Nutch 2.2.1 and HBase 0.90.4. I am expecting around 300K urls from about 10 URLS in seed. I have already generated so much while using Nutch 1.6. Since I want to manipulate data, I preferred to go Nutch 2.2.1 + HBase route. But I get all sorts of weird errors and crawl doesn't seem to progress.</p> <p>Various errors such as:</p> <ol> <li><p><strong>zookeeper.ClientCnxn - Session for server null, unexpected error, closing socket connection and attempting reconnect.</strong> - I get this more frequently</p></li> <li><p><strong>bin/crawl: line 164: killed</strong> - I get this error from fetch step and the crawling gets killed all of a sudden. </p></li> <li><p>RSS parse error</p></li> </ol> <p>I am using a all-in-one crawl command - <code>bin/crawl urls 1 http://localhost:8983/solr/ 10</code> </p> <pre><code>&lt;crawl&gt; &lt;seed-dir&gt; &lt;crawl-id&gt; &lt;solr-url&gt; &lt;number of rounds&gt; </code></pre> <p>Please suggest where am I going wrong. I have Nutch 2.2.1 <a href="http://wiki.apache.org/nutch/Nutch2Tutorial" rel="nofollow">installed</a> and HBase (standalone) installed as per the <a href="http://hbase.apache.org/book/quickstart.html" rel="nofollow">Quick start guide</a> recommended from Nutch site. I am not sure following HBase 0.90.4 standalone set up from <strong>Quick</strong> start guide link is sufficient to achieve 300K crawled urls.</p> <hr> <p>Edit # 1: RSS Parse Error - log information</p> <p><strong>Error tika.TikaParser - Error parsing <a href="http://www.###.###.##/###/abc.xml">http://www.###.###.##/###/abc.xml</a> org.apache.tika.exception.TikaException: RSS parse error</strong></p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload