Note that there are some explanatory texts on larger screens.

plurals
  1. POStrange Behavior with ConcurrentUpdateSolrServer Class
    primarykey
    data
    text
    <p>I'm using Solrj to index some files but I've noticed a weird behavior using the <b>ConcurrentUpdateSolrServer</b> Class. My aim is to index files very fastly (15000 documents per seconds).</p> <p>I've set up one Solr instance on a distant Virtual Machine (VM) on Linux with 8 CPUs and I've implemented a java program with Solrj on my computer using Eclipse. I will describe both scenarios I've tried in order to explain my problem : </p> <p>Scenario 1 : </p> <p>I've run my java program using eclipse to index my documents defining my server with the adress of my VM like that : </p> <pre><code>String url = "http://10.35.1.72:8080/solr/"; ConcurrentUpdateSolrServer server = new ConcurrentUpdateSolrServer(url,4000,20); </code></pre> <p>And I've added my documents creating a java class that extends <b>Thread</b> doing that :</p> <pre><code>@Override public void run(){ SolrInputDocument doc = new SolrInputDocument(); /* * Processing on document to add fields ... */ UpdateResponse response = server.add(doc); /* * Response's Analysis */ </code></pre> <p>But to avoid to add documents in a sequential way, I've used an <b>Executor</b> to add my documents in a parallel way like this :</p> <pre><code>Executor executor = Executors.newFixedThreadPool(nbThreads); for (int j = 0; j &lt; myfileList.size(); j++) { executor.execute(new myclassThread(server,new myfileList(j))); } </code></pre> <p>When I run this program, the result is fine. All my documents are well indexed in the solr indexes. I can see it on the solr admin : </p> <pre><code>Results : numDocs: 3588 maxDoc: 3588 deletedDocs: 0 </code></pre> <p>The problem is that indexing performances are very low (slow indexing speed) compared to an indexing without using solrj and an indexing on the VM. That's why, I've created a jar file of my program to run it on my VM. </p> <p>Scenario 2 :</p> <p>So, I've generated a jar file with eclipse and run it on my VM. I've change the server's url like this :</p> <pre><code>String url = "http://localhost:8080/solr/"; ConcurrentUpdateSolrServer server = new ConcurrentUpdateSolrServer(url,4000,20); </code></pre> <p>I've run my jar file like this with the <b>same documents collection</b> (3588 documents with unique id): </p> <pre><code>java -jar myJavaProgram.jar </code></pre> <p>And the result on the Solr Admin is :</p> <pre><code>Results : numDocs: 2554 maxDoc: 3475 deletedDocs: 921 </code></pre> <p>This result depend of my thread setting (for Executor and SolrServer). To finish, not all the documents are indexed but indexing speed is better. I guess that the adding of my documents is too much fast for Solr and there are some losses.</p> <p>I didn't succeeded to find the right setting of my threads. No matter if I set plenty or few threads, in any case, I have losses.</p> <p><b>Questions :</b></p> <ul> <li>Does anyone have heard a problem with the ConcurrentUpdateSolrServer Class ? </li> <li>Is there an explanation of these losses ? Why all my documents are not indexed in the second scenario ? And why some documents are deleted even they have a unique key ? </li> <li>Is there a proper way to add documents with Solrj in a parallel way (not sequential) ? </li> <li><del> I've seen another Solrj class to index the data : EmbeddedSolrServer. Does this class allow to improve the indexing speed or is safer than ConcurrentUpdateSolrServer to index data ?</del></li> <li>When I analyse the response of the add() method, I've notice that the result is always OK (response.getstatut()=0) but it's not the true because my documents are not well indexed. So, is it a normal behavior of this add() method or not ?</li> <li>To finish, if someone can advise me on the way to index data very fastly, I will appreciate a lot ! :-)</li> </ul> <p><b>Edit : </b></p> <p>I've tried to slow down my indexing speed using <i>Thread.sleep(time)</i> between each call of the add() method of the ConcurrentUpdateServer.</p> <p>I've tried to commit() after each call of the add() method of the ConcurrentUpdateServer (I know that is not a good solution to commit at each adding but it was to test).</p> <p>I've tried to not use Executor to manage my threads and I've created one or several static threads.</p> <p>After testing these several strategies to index my document collection, I've decided to use the EmbeddedSolrServer class to see if the results are better.</p> <p>So I've implement this code to use the EmbeddedSolrServer : </p> <pre><code> final File solrConfigXml = new File( "/home/usersolr/solr-4.2.1/indexation_test1/solr/solr.xml" ); final String solrHome = "/home/usersolr/solr-4.2.1/indexation_test1/solr"; CoreContainer coreContainer; try{ coreContainer = new CoreContainer( solrHome, solrConfigXml ); }catch( Exception e ){ e.printStackTrace( System.err ); throw new RuntimeException( e ); } EmbeddedSolrServer server = new EmbeddedSolrServer( coreContainer, "collection1" ); </code></pre> <p>I added the right JARs to make it work and I succeeded to index my collection.</p> <p>But, after these tries, I still get in trouble with the behavior of Solr... I still have the same losses. </p> <pre><code>Result : Number of documents indexed :2554 </code></pre> <p>2554 docs / 3588 docs (myCollection) ...</p> <p>I guess that my problem is more technical. But my computing knowledge stops there ! :( Why do I get some losses when I index my documents on my VM while I don't have these losses when I execute my java program from my computer ?</p> <p>Is there a link with Jetty (maybe it cannot absorbe the input stream ?) ? Are there some components (buffers, RAM overflow ?) that have some limits on Solr ? </p> <p>If I'm not enough clear about my problem, please, tell me and I'll try to make it clearer.</p> <p>Thanks</p> <p>Corentin</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload