Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Generally, you what you've described is almost exactly how Nutch works. Nutch is an crawling, indexing, index merging and query answering toolkit that's based on Hadoop core.</p> <p>You shouldn't mix Cloudera, Hadoop, Nutch and Lucene. You'll most likely end up using all of them:</p> <ul> <li><strong>Nutch</strong> is the name of indexing / answering (like Solr) machinery.</li> <li>Nutch itself runs using a <strong>Hadoop</strong> cluster (which heavily uses it's own distributed file system, HDFS)</li> <li>Nutch uses <strong>Lucene</strong> format of indexes</li> <li>Nutch includes a query answering frontend, which you can use, or you can attach a <strong>Solr</strong> frontend and use Lucene indexes from there.</li> <li>Finally, <strong>Cloudera Hadoop Distribution</strong> (or CDH) is just a Hadoop distribution with several dozens of patches applied to it, to make it more stable and backport some useful features from development branches. Yeah, you'd most likely want to use it, unless you have a reason not to (for example, if you want a bleeding edge Hadoop 0.22 trunk).</li> </ul> <p>Generally, if you're just looking into a ready-made crawling / search engine solution, then Nutch is a way to go. Nutch already includes a lot of plugins to parse and index various crazy types of documents, include MS Word documents, PDFs, etc, etc.</p> <p>I personally don't see much point in using .NET technologies here, but if you feel comfortable with it, you can do front-ends in .NET. However, working with Unix technologies might feel fairly awkward for Windows-centric team, so if I'd managed such a project, I'd considered alternatives, especially if your task of crawling &amp; indexing is limited (i.e. you don't want to crawl the whole internet for some purpose).</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload