StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow does one instruct the ExtractingRequestHandler to parse only the body of a document?
text
Body
copied!<p>How can I instruct the extracting request handler to ignore metadata/headers etc. when it constructs the "content" of the document I send to it?</p> <p>For example, I created an MS Word document containing just the word "SEARCHWORD" and nothing else. However, when I ship this doc to my solr index, its contents are mapped to my "body" field as follows:</p> <pre><code><str name="body"> Last-Printed 2009-02-05T15:02:00Z Revision-Number 22 Comments stream_source_info myfile Last-Author Inigo Montoya Template Normal.dotm Page-Count 1 subject Application-Name Microsoft Macintosh Word Author Jesus Baggins Word-Count 2 xmpTPg:NPages 1 Edit-Time 108600000000 Creation-Date 2008-11-05T20:19:00Z stream_content_type application/octet-stream Character Count 14 stream_size 31232 stream_name /Applications/MAMP/tmp/php/phpHCIg7y Some Company Content-Type application/msword Keywords Last-Save-Date 2012-05-01T18:55:00Z SEARCHWORD </str> </code></pre> <p>All I want is the body of the document, in this case the word "SEARCHWORD."</p> <p>For further reference, here's my extraction handler:</p> <pre><code> <requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults">  <str name="fmap.content">body</str> <str name="lowernames">true</str> <str name="uprefix">ignored_</str> </lst> </requestHandler> </code></pre>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload