Note that there are some explanatory texts on larger screens.

plurals
  1. POWhy does Nutch think it has already parsed all segments when it hasn't?
    text
    copied!<p>I'm using Nutch 1.6 to crawl some forums and index them with Solr 1.6.2. I ran a test query on Solr and was surprised that there were only a few results. I was worried that there was a problem either with Nutch's parsing of the pages or with Solr's indexing. After snooping around I found out that Nutch hasn't parsed a lot of the pages it has retrieved:</p> <pre><code>bin/nutch readseg -list -dir crawl-mothering2/segments/ NAME GENERATED FETCHED PARSED 20130228001531 23 27 9 20130228003940 1430 1434 661 20130228001829 202 206 105 20130228061337 1068 1090 475 20130228091009 1 2 0 20130228085956 34 34 25 20130228090348 44 45 34 20130228090851 7 7 6 20130228080438 364 374 192 20130228030933 1774 1795 903 20130228084205 168 169 63 </code></pre> <p>But when I try to parse the segments, I get this:</p> <pre><code>bin/nutch parse crawl-mothering2/segments/* ParseSegment: starting at 2013-03-21 00:20:43 ParseSegment: segment: crawl-mothering2/segments/20130228001531 Exception in thread "main" java.io.IOException: Segment already parsed! at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:209) at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:243) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:216) </code></pre> <p>What gives?</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload