StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POGetting No Urls to Fetch error on Nutch, even though there are Urls to fetch
primarykey
Id
17458155
data
AcceptedAnswerId
17552285
AnswerCount
1
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2013-07-03T21:08:55.100
FavoriteCount
0
LastActivityDate
2013-07-09T15:32:16.777
LastEditDate
2013-07-05T17:43:20.160
LastEditorUserId
340648
OwnerUserId
340648
ParentId
0
PostTypeId
1
Score
3
ViewCount
3459
LastEditorDisplayName
text
Body
I am still getting used to Nutch. I managed to get a test crawl going using <code>bin/nutch crawl urls -dir crawl -depth 6 -topN 10</code> over <code>nutch.apache.org</code> as well as indexing it to solr using: <code>bin/nutch crawl urls -solr http://<domain>:<port>/solr/core1/ -depth 4 -topN 7</code> Not even mentioning that it times out on my own site, I can't seem to get it to crawl again, or crawl any other sites (e.g. wiki.apache.org). I have deleted all of the crawl directories in the nutch home directory and I still get the following error (stating that there are no more URLs to crawl): <pre><code><user>@<domain>:/usr/share/nutch$ sudo sh nutch-test.sh solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 6 solrUrl=null topN = 10 Injector: starting at 2013-07-03 15:56:47 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 1 Injector: total number of urls injected after normalization and filtering: 0 Injector: Merging injected urls into crawl db. Injector: finished at 2013-07-03 15:56:50, elapsed: 00:00:03 Generator: starting at 2013-07-03 15:56:50 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 10 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawl </code></pre> My <code>urls/seed.txt</code> file has <code>http://nutch.apache.org/</code> in it. My <code>regex-urlfilter.txt</code> has <code>+^http://([a-z0-9\-A-Z]*\.)*nutch.apache.org//([a-z0-9\-A-Z]*\/)*</code> in it. I have also increase the <code>-depth</code> and <code>topN</code> to specify that there is more to index, but it always gives the error after the first crawl. How do I reset it so that it crawls again? Is there some cache of URLs that needs to be cleaned out somewhere in Nutch? UPDATE: It seems the problem with our site was that I was not using <code>www</code>, it did not resolve without <code>www</code>. By a <code>ping</code>, www.ourdomain.org does resolve. But i have put this into the necessary files and there is still a problem. Primarily it looks like <code>Injector: total number of urls rejected by filters: 1</code> is the problem across the board, but was not on the first crawl. Why and what filter is rejecting the URL, it should not be.
Tags
<solr><web-crawler><nutch>
Title
Getting No Urls to Fetch error on Nutch, even though there are Urls to fetch
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USroy
UserOwnerUserId
1. USroy
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. This table or related slice is empty.
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.