StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POPairWise matching millions of records
primarykey
Id
20116667
data
AcceptedAnswerId
20123956
AnswerCount
6
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2013-11-21T09:14:01.247
FavoriteCount
0
LastActivityDate
2013-11-26T09:56:38.103
LastEditDate
2013-11-25T08:38:55.417
LastEditorUserId
1353323
OwnerUserId
1353323
ParentId
0
PostTypeId
1
Score
3
ViewCount
900
LastEditorDisplayName
text
Body
I have an algorithmic problem at hand. To easily explain the problem, I will be using a simple analogy. I have an input file <pre><code>Country,Exports Austrailia,Sheep US, Apple Austrialia,Beef </code></pre> End Goal: I have to find the common products between the pairs of countries so <pre><code>{"Austrailia,New Zealand"}:{"apple","sheep} {"Austrialia,US"}:{"apple"} {"New Zealand","US"}:{"apple","milk"} </code></pre> Process : I read in the input and store it in a TreeMap > Where the List, the strings are interned due to many duplicates. Essentially, I am aggregating by country. where Key is country, Values are its Exports. <pre><code>{"austrailia":{"apple","sheep","koalas"}} {"new zealand":{"apple","sheep","milk"}} {"US":{"apple","beef","milk"}} </code></pre> I have about 1200 keys (countries) and total number of values(exports) is 80 million altogether. I sort all the values of each key: <pre><code>{"austrailia":{"apple","sheep","koalas"}} -- > {"austrailia":{"apple","koalas","sheep"}} </code></pre> This is fast as there are only 1200 Lists to sort. <pre><code>for(k1:keys) for(k2:keys) if(k1.compareTo(k2) <0){ //Dont want to double compare List<String> intersectList = intersectList_func(k1's exports,k2's exports); countriespair.put({k1,k2},intersectList) } </code></pre> This code block takes so long.I realise it O(n2) and around 1200*1200 comparisions.Thus,Running for almost 3 hours till now.. Is there any way, I can speed it up or optimise it. Algorithm wise is best option, or are there other technologies to consider. Edit: Since both List are sorted beforehand, the intersectList is O(n) where n is length of floor(listOne.length,listTwo.length) and NOT O(n2) as discussed below <pre><code>private static List<String> intersectList(List<String> listOne,List<String> listTwo){ int i=0,j=0; List<String> listResult = new LinkedList<String>(); while(i!=listOne.size() && j!=listTwo.size()){ int compareVal = listOne.get(i).compareTo(listTwo.get(j)); if(compareVal==0){ listResult.add(listOne.get(i)); i++;j++;} } else if(compareVal < 0) i++; else if (compareVal >0) j++; } return listResult; } </code></pre> Update 22 Nov My current implementation is still running for almost 18 hours. :| Update 25 Nov I had run the new implementation as suggested by Vikram and a few others. It's been running this Friday. My question, is that how does grouping by exports rather than country save computational complexity. I find that the complexity is the same. As Groo mentioned, I find that the complexity for the second part is O(E*C^2) where is E is exports and C is country.
Tags
<java><algorithm><bigdata>
Title
PairWise matching millions of records
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USprog_guy
UserOwnerUserId
1. USprog_guy
plurals
PostLinksPostIdRelatedPostId
1. PL
 singulars
 LinkTypeLinkTypeId
 LTLinked
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POPairWise matching millions of records
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POPairWise matching millions of records
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POPairWise matching millions of records
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COUse a SQL DB and query would be a possible solution.
 singulars
 PostPostId
 POPairWise matching millions of records
 UserUserId
 USOrtwin Angermeier
2. CO@prog_guy Give your input file to test my code on
 singulars
 PostPostId
 POPairWise matching millions of records
 UserUserId
 USVikram Bhat

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.