Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>There are basically two ways to index binary documents within Solr, both with Tika:</p> <ol> <li>Using Tika on the client side to extract information from binary files and then manually indexing the extracted text within Solr</li> <li>Using <a href="http://wiki.apache.org/solr/ExtractingRequestHandler" rel="nofollow">ExtractingRequestHandler</a> through which you can upload the binary file to the Solr server so that Solr can do the work for you. This way tika is not required on the client side.</li> </ol> <p>In both cases you need to have the binary documents on the client side. While crawling, nutch should be able to download binary files, use Tika to generate text content out of them and then index data in Solr as it'd normally do with text documents. Nutch already uses <a href="https://issues.apache.org/jira/browse/NUTCH-766" rel="nofollow">Tika</a>, I guess it's just a matter of configuring the type of documents you want to index changing the regex-urlfilter.txt nutch config file by removing from the following lines the file extensions that you want to index.</p> <pre><code># skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ </code></pre> <p>This way you would use the first option I mentioned. Then you need to enable the Tika plugin on nutch within your nutch-site.xml, have a look at <a href="http://lucene.472066.n3.nabble.com/Using-Tika-to-crawl-doc-pdf-etc-tt603220.html#a603223" rel="nofollow">this discussion</a> from the nutch mailing list.</p> <p>This should theoretically work, let me know if it doesn't.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload