Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to index pdf's content with SolrJ?
    primarykey
    data
    text
    <p>I'm trying to index a few pdf documents using SolrJ as described at <a href="http://wiki.apache.org/solr/ContentStreamUpdateRequestExample" rel="nofollow">http://wiki.apache.org/solr/ContentStreamUpdateRequestExample</a>, below there's the code:</p> <pre><code>import static org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX; import static org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX; import static org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX; import org.apache.solr.client.solrj.SolrServer; import org.apache.solr.client.solrj.SolrServerException; import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer; import org.apache.solr.client.solrj.request.AbstractUpdateRequest; import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest; import org.apache.solr.common.util.NamedList; ... public static void indexFilesSolrCell(String fileName) throws IOException, SolrServerException { String urlString = "http://localhost:8080/solr"; SolrServer server = new CommonsHttpSolrServer(urlString); ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract"); up.addFile(new File(fileName)); String id = fileName.substring(fileName.lastIndexOf('/')+1); System.out.println(id); up.setParam(LITERALS_PREFIX + "id", id); up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't exists in schema.xml, it'll be created as attr_location up.setParam(UNKNOWN_FIELD_PREFIX, "attr_"); up.setParam(MAP_PREFIX + "content", "attr_content"); up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); NamedList&lt;Object&gt; request = server.request(up); for(Entry&lt;String, Object&gt; entry : request){ System.out.println(entry.getKey()); System.out.println(entry.getValue()); } } </code></pre> <p>Unfortunately when querying for *:* I get the list of indexed documents but the content field is empty. How can I change the code above to extract also the document's content?</p> <p>Below there's the xml frament that describes <a href="http://www.objectmentor.com/resources/articles/lsp.pdf" rel="nofollow">this document</a>:</p> <pre><code>&lt;doc&gt; &lt;arr name="attr_content"&gt; &lt;str&gt; &lt;/str&gt; &lt;/arr&gt; &lt;arr name="attr_location"&gt; &lt;str&gt;/home/alex/Documents/lsp.pdf&lt;/str&gt; &lt;/arr&gt; &lt;arr name="attr_meta"&gt; &lt;str&gt;stream_size&lt;/str&gt; &lt;str&gt;31203&lt;/str&gt; &lt;str&gt;Content-Type&lt;/str&gt; &lt;str&gt;application/pdf&lt;/str&gt; &lt;/arr&gt; &lt;arr name="attr_stream_size"&gt; &lt;str&gt;31203&lt;/str&gt; &lt;/arr&gt; &lt;arr name="content_type"&gt; &lt;str&gt;application/pdf&lt;/str&gt; &lt;/arr&gt; &lt;str name="id"&gt;lsp.pdf&lt;/str&gt; &lt;/doc&gt; </code></pre> <p>I don't think that this problem is related to an incorrect installation of Apache Tika, because previously I had a few ServerException but now I've installed the required jars in the correct path. Moreover I've tried to index a txt file using the same class but the <strong>attr_content</strong> field is always empty.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload