Note that there are some explanatory texts on larger screens.

plurals
  1. POHadoop to create an Index and Add() it to distributed SOLR... is this possible? Should I use Nutch? ..Cloudera?
    text
    copied!<p>Can I use a MapReduce framework to create an index and somehow add it to a distributed Solr?</p> <p>I have a burst of information (logfiles and documents) that will be transported over the internet and stored in my datacenter (or Amazon). It needs to be parsed, indexed, and finally searchable by our replicated Solr installation. </p> <p>Here is my proposed architecture:</p> <ul> <li>Use a MapReduce framework (Cloudera, Hadoop, Nutch, even <a href="http://research.microsoft.com/en-us/projects/dryadlinq/default.aspx" rel="nofollow">DryadLinq</a>) to prepare those documents for indexing</li> <li>Index those documents into a Lucene.NET / Lucene (java) compatible file format</li> <li>Deploy that file to all my Solr instances </li> <li>Activate that replicated index</li> </ul> <p>If that above is possible, I need to choose a MapReduce framework. Since Cloudera is vendor supported and has a ton of patches not included in the Hadoop install, I think it may be worth looking at.</p> <p>Once I choose the MatpReduce framework, I need to tokenize the documents (PDF, DOCx, DOC, OLE, etc...), index them, copy the index to my Solr instances, and somehow "activate" them so they are searchable in the running instance. I believe this methodolgy is better that submitting documents via the REST interface to Solr. </p> <p>The reason I bring .NET into the picture is because we are mostly a .NET shop. The only Unix / Java we will have is Solr and have a front end that leverages the REST interface via Solrnet. </p> <blockquote> <p>Based on your experience, how does this architecture look? Do you see any issues/problems? What advice can you give?</p> </blockquote> <p>What should I <em>not</em> do to lose faceting search? After reading the Nutch documentation, I believe it said that it does not do faceting, but I may not have enough background in this software to understand what it's saying.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload