StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POParallel crawling through large graph with circular references
text
Body
copied!<p>In few words: I want to process large graph with circular references in parallel way. And also I don't have access to full graph, I have to crawl through it. And I want to organize effective queue to do that. I'm interested is there any best practices to do that?</p> <p>I'm trying to organize infinite data processing flow for such strategy: each thread takes node to process from queue, processes it, after processing - some new nodes for processing might appears - so thread has to put them into queue. But I don't have to process each node more than once. Nodes are immutable entities.</p> <p>As I understand - I have to use some threadsafe implementation of queue and set (for already visited instances).</p> <p><strong>I'm trying to avoid synchronized methods.</strong> So, my implementation of this flow:</p> <ol> <li><p>When thread adding nodes to the queue, it checking each node: if visited-nodes-set contains this node, thread don't add it to the queue. But that's not all</p></li> <li><p>When thread takes node from the queue - it check if visited-nodes-set contains this node. If contains, thread takes another node from queue, until get node, which hasn't been processed yet. After finding unprocessed node - thread also adding it to the visited-nodes-set.</p></li> </ol> <p>I've tried to use LinkedBlockingQueue and ConcurrentHashMap (as a set). I've used ConcurrentHashMap, because it contains method putIfAbsent(key, value) - which, as I understand, helps atomically: check if map contains key, and if doesn't contain - add it.</p> <p>Here is implementation of described algorithm:</p> <pre><code>public class ParallelDataQueue { private LinkedBlockingQueue<String> dataToProcess = new LinkedBlockingQueue<String>(); // using map as a set private ConcurrentHashMap<String, Object> processedData = new ConcurrentHashMap<String, Object>( 1000000 ); private final Object value = new Object(); public String getNextDataInstance() { while ( true ) { try { String data = this.dataToProcess.take(); Boolean dataIsAlreadyProcessed = ( this.processedData.putIfAbsent( data, this.value ) != null ); if ( dataIsAlreadyProcessed ) { continue; } else { return data; } } catch ( InterruptedException e ) { e.printStackTrace(); } } } public void addData( Collection<String> data ) { for ( String d : data ) { if ( !this.processedData.containsKey( d ) ) { try { this.dataToProcess.put( d ); } catch ( InterruptedException e ) { e.printStackTrace(); } } } } } </code></pre> <p>So my question - does current implementation avoid processing of repeatable nodes. And, maybe there is more elegant solution?</p> <p>Thanks</p> <p><strong>P.S.</strong></p> <p>I understand, that such implementation doesn't avoid appearence duplicates of nodes in queue. But for me it is not critical - all I need, is to avoid processing each node more than once.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload