Note that there are some explanatory texts on larger screens.

plurals
  1. POPairWise matching millions of records
    primarykey
    data
    text
    <p>I have an algorithmic problem at hand. To easily explain the problem, I will be using a simple analogy. I have an input file </p> <pre><code>Country,Exports Austrailia,Sheep US, Apple Austrialia,Beef </code></pre> <p>End Goal: I have to find the common products between the pairs of countries so</p> <pre><code>{"Austrailia,New Zealand"}:{"apple","sheep} {"Austrialia,US"}:{"apple"} {"New Zealand","US"}:{"apple","milk"} </code></pre> <p>Process :</p> <p>I read in the input and store it in a TreeMap > Where the List, the strings are interned due to many duplicates. Essentially, I am aggregating by country. where Key is country, Values are its Exports.</p> <pre><code>{"austrailia":{"apple","sheep","koalas"}} {"new zealand":{"apple","sheep","milk"}} {"US":{"apple","beef","milk"}} </code></pre> <p>I have about 1200 keys (countries) and total number of values(exports) is 80 million altogether. I sort all the values of each key:</p> <pre><code>{"austrailia":{"apple","sheep","koalas"}} -- &gt; {"austrailia":{"apple","koalas","sheep"}} </code></pre> <p>This is fast as there are only 1200 Lists to sort.</p> <pre><code>for(k1:keys) for(k2:keys) if(k1.compareTo(k2) &lt;0){ //Dont want to double compare List&lt;String&gt; intersectList = intersectList_func(k1's exports,k2's exports); countriespair.put({k1,k2},intersectList) } </code></pre> <p>This code block takes so long.I realise it O(n2) and around 1200*1200 comparisions.Thus,Running for almost 3 hours till now.. Is there any way, I can speed it up or optimise it. Algorithm wise is best option, or are there other technologies to consider.</p> <p><strong>Edit</strong>: <em>Since both List are sorted beforehand, the intersectList is O(n) where n is length of floor(listOne.length,listTwo.length) and NOT O(n2) as discussed below</em></p> <pre><code>private static List&lt;String&gt; intersectList(List&lt;String&gt; listOne,List&lt;String&gt; listTwo){ int i=0,j=0; List&lt;String&gt; listResult = new LinkedList&lt;String&gt;(); while(i!=listOne.size() &amp;&amp; j!=listTwo.size()){ int compareVal = listOne.get(i).compareTo(listTwo.get(j)); if(compareVal==0){ listResult.add(listOne.get(i)); i++;j++;} } else if(compareVal &lt; 0) i++; else if (compareVal &gt;0) j++; } return listResult; } </code></pre> <p><strong>Update 22 Nov</strong> My current implementation is still running for almost 18 hours. :|</p> <p><strong>Update 25 Nov</strong> I had run the new implementation as suggested by Vikram and a few others. It's been running this Friday. My question, is that how does grouping by exports rather than country save computational complexity. I find that the complexity is the same. As Groo mentioned, I find that the complexity for the second part is O(E*C^2) where is E is exports and C is country.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload