Note that there are some explanatory texts on larger screens.

plurals
  1. POGetting No Urls to Fetch error on Nutch, even though there are Urls to fetch
    primarykey
    data
    text
    <p>I am still getting used to Nutch. I managed to get a test crawl going using <code>bin/nutch crawl urls -dir crawl -depth 6 -topN 10</code> over <code>nutch.apache.org</code> as well as indexing it to solr using: <code>bin/nutch crawl urls -solr http://&lt;domain&gt;:&lt;port&gt;/solr/core1/ -depth 4 -topN 7</code></p> <p>Not even mentioning that it times out on my own site, I can't seem to get it to crawl again, or crawl any other sites (e.g. wiki.apache.org). I have deleted all of the crawl directories in the nutch home directory and I still get the following error (stating that there are no more URLs to crawl):</p> <pre><code>&lt;user&gt;@&lt;domain&gt;:/usr/share/nutch$ sudo sh nutch-test.sh solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 6 solrUrl=null topN = 10 Injector: starting at 2013-07-03 15:56:47 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 1 Injector: total number of urls injected after normalization and filtering: 0 Injector: Merging injected urls into crawl db. Injector: finished at 2013-07-03 15:56:50, elapsed: 00:00:03 Generator: starting at 2013-07-03 15:56:50 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawl </code></pre> <p>My <code>urls/seed.txt</code> file has <code>http://nutch.apache.org/</code> in it.</p> <p>My <code>regex-urlfilter.txt</code> has <code>+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org//([a-z0-9\-A-Z]*\/)*</code> in it. </p> <p>I have also increase the <code>-depth</code> and <code>topN</code> to specify that there is more to index, but it always gives the error after the first crawl. How do I reset it so that it crawls again? Is there some cache of URLs that needs to be cleaned out somewhere in Nutch?</p> <p><strong>UPDATE</strong>: It seems the problem with our site was that I was not using <code>www</code>, it did not resolve without <code>www</code>. By a <code>ping</code>, www.ourdomain.org does resolve. </p> <p>But i have put this into the necessary files and there is still a problem. Primarily it looks like <code>Injector: total number of urls rejected by filters: 1</code> is the problem across the board, but was not on the first crawl. Why and what filter is rejecting the URL, it should not be.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload