StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POStrange Behavior with ConcurrentUpdateSolrServer Class
primarykey
Id
17409151
data
AcceptedAnswerId
17569655
AnswerCount
1
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2013-07-01T16:19:53.463
FavoriteCount
0
LastActivityDate
2015-03-20T23:31:48.590
LastEditDate
2013-07-03T14:02:16.507
LastEditorUserId
2262283
OwnerUserId
2262283
ParentId
0
PostTypeId
1
Score
0
ViewCount
1865
LastEditorDisplayName
text
Body
I'm using Solrj to index some files but I've noticed a weird behavior using the ConcurrentUpdateSolrServer Class. My aim is to index files very fastly (15000 documents per seconds). I've set up one Solr instance on a distant Virtual Machine (VM) on Linux with 8 CPUs and I've implemented a java program with Solrj on my computer using Eclipse. I will describe both scenarios I've tried in order to explain my problem : Scenario 1 : I've run my java program using eclipse to index my documents defining my server with the adress of my VM like that : <pre><code>String url = "http://10.35.1.72:8080/solr/"; ConcurrentUpdateSolrServer server = new ConcurrentUpdateSolrServer(url,4000,20); </code></pre> And I've added my documents creating a java class that extends Thread doing that : <pre><code>@Override public void run(){ SolrInputDocument doc = new SolrInputDocument(); /* * Processing on document to add fields ... */ UpdateResponse response = server.add(doc); /* * Response's Analysis */ </code></pre> But to avoid to add documents in a sequential way, I've used an Executor to add my documents in a parallel way like this : <pre><code>Executor executor = Executors.newFixedThreadPool(nbThreads); for (int j = 0; j < myfileList.size(); j++) { executor.execute(new myclassThread(server,new myfileList(j))); } </code></pre> When I run this program, the result is fine. All my documents are well indexed in the solr indexes. I can see it on the solr admin : <pre><code>Results : numDocs: 3588 maxDoc: 3588 deletedDocs: 0 </code></pre> The problem is that indexing performances are very low (slow indexing speed) compared to an indexing without using solrj and an indexing on the VM. That's why, I've created a jar file of my program to run it on my VM. Scenario 2 : So, I've generated a jar file with eclipse and run it on my VM. I've change the server's url like this : <pre><code>String url = "http://localhost:8080/solr/"; ConcurrentUpdateSolrServer server = new ConcurrentUpdateSolrServer(url,4000,20); </code></pre> I've run my jar file like this with the same documents collection (3588 documents with unique id): <pre><code>java -jar myJavaProgram.jar </code></pre> And the result on the Solr Admin is : <pre><code>Results : numDocs: 2554 maxDoc: 3475 deletedDocs: 921 </code></pre> This result depend of my thread setting (for Executor and SolrServer). To finish, not all the documents are indexed but indexing speed is better. I guess that the adding of my documents is too much fast for Solr and there are some losses. I didn't succeeded to find the right setting of my threads. No matter if I set plenty or few threads, in any case, I have losses. Questions : <ul> <li>Does anyone have heard a problem with the ConcurrentUpdateSolrServer Class ? </li> <li>Is there an explanation of these losses ? Why all my documents are not indexed in the second scenario ? And why some documents are deleted even they have a unique key ? </li> <li>Is there a proper way to add documents with Solrj in a parallel way (not sequential) ? </li> <li><del> I've seen another Solrj class to index the data : EmbeddedSolrServer. Does this class allow to improve the indexing speed or is safer than ConcurrentUpdateSolrServer to index data ?</del></li> <li>When I analyse the response of the add() method, I've notice that the result is always OK (response.getstatut()=0) but it's not the true because my documents are not well indexed. So, is it a normal behavior of this add() method or not ?</li> <li>To finish, if someone can advise me on the way to index data very fastly, I will appreciate a lot ! :-)</li> </ul> Edit : I've tried to slow down my indexing speed using Thread.sleep(time) between each call of the add() method of the ConcurrentUpdateServer. I've tried to commit() after each call of the add() method of the ConcurrentUpdateServer (I know that is not a good solution to commit at each adding but it was to test). I've tried to not use Executor to manage my threads and I've created one or several static threads. After testing these several strategies to index my document collection, I've decided to use the EmbeddedSolrServer class to see if the results are better. So I've implement this code to use the EmbeddedSolrServer : <pre><code> final File solrConfigXml = new File( "/home/usersolr/solr-4.2.1/indexation_test1/solr/solr.xml" ); final String solrHome = "/home/usersolr/solr-4.2.1/indexation_test1/solr"; CoreContainer coreContainer; try{ coreContainer = new CoreContainer( solrHome, solrConfigXml ); }catch( Exception e ){ e.printStackTrace( System.err ); throw new RuntimeException( e ); } EmbeddedSolrServer server = new EmbeddedSolrServer( coreContainer, "collection1" ); </code></pre> I added the right JARs to make it work and I succeeded to index my collection. But, after these tries, I still get in trouble with the behavior of Solr... I still have the same losses. <pre><code>Result : Number of documents indexed :2554 </code></pre> 2554 docs / 3588 docs (myCollection) ... I guess that my problem is more technical. But my computing knowledge stops there ! :( Why do I get some losses when I index my documents on my VM while I don't have these losses when I execute my java program from my computer ? Is there a link with Jetty (maybe it cannot absorbe the input stream ?) ? Are there some components (buffers, RAM overflow ?) that have some limits on Solr ? If I'm not enough clear about my problem, please, tell me and I'll try to make it clearer. Thanks Corentin
Tags
<solr><solrj>
Title
Strange Behavior with ConcurrentUpdateSolrServer Class
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USCorentin
UserOwnerUserId
1. USCorentin
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. This table or related slice is empty.
CommentsPostId
1. COAny idea ? I can't find the same indexing performances that I had without Solrj. I have written a script in shell using CURL to index my document collection and I got no losses. Can I really do the same with Solrj ? What are the limits using Solrj ?
 singulars
 PostPostId
 POStrange Behavior with ConcurrentUpdateSolrServer Class
 UserUserId
 USCorentin

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.