Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>The problem with using Solr to index is that it is a straight text index (which may be fine if you are only crawling an internal website and don´t care about 'pagerank').</p> <p>Using Nutch will give you a much better index however as it does use pagerank.</p> <p><strong>NutchWAX</strong></p> <p>If however you are deadset on using Heritrix <em>and</em> would like pagerank based search results you could use <a href="http://archive-access.sourceforge.net/projects/nutchwax/" rel="nofollow noreferrer">NutchWAX</a> (Nutch Web Archive eXtensions) to index Heritrix's output (that's what the makers of Heritrix are doing).</p> <p>NutchWAX is intended for web archives but can also be used to create a search engine of the live web (in fact that is easier as you aren't dragging years worth of data along during each rebuild of the index). </p> <p><strong>Solr</strong></p> <p>If you do want to use Heritrix+Solr to create a search website, you should probably replace the "ARCWriter" processor in Heritrix with a custom processor that submits the contents of the page to Solr.</p> <p>The Solr end is just an XML file posted via HTTP and is dead simple.</p> <p>The Heritrix end is little bit more complicated, but the <a href="http://crawler.archive.org/articles/developer_manual/processor.html" rel="nofollow noreferrer">Developer's Manual</a> will get you started on writing a Processor for Heritrix 1.x (if you are using the --as yet-- unstable 3.x -- or discontinued 2.x -- you'll need to do a little more legwork as the documentation isn't there yet.).</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload