StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
15728728
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
2
CommunityOwnedDate
CreationDate
2013-03-31T11:03:23.107
FavoriteCount
0
LastActivityDate
2013-03-31T11:03:23.107
LastEditDate
LastEditorUserId
0
OwnerUserId
649665
ParentId
15728266
PostTypeId
2
Score
3
ViewCount
0
LastEditorDisplayName
text
Body
What you're proposing is sometimes called a look-aside table; a secondary table used for various lookup purposes. In your case, you have a number of different possible ways of organizing this table. The most obvious is to not organize it, and use linear search to see if the next element is already known. Since the table will end up containing some 30000 elements, that's probably not a good idea. From the standard library (at least in C++11), there are two possibilities: <code>std::set</code> and <code>std::unordered_set</code>. <code>std::set</code> uses some form of balanced tree, so makes at most lg n comparisions for each lookup (around 15 for 30000 elements); <code>std::unordered_set</code> is a hash table, and with a good hash function, will require as small constant number of comparisons: you should be able to get it down to under 2 on the average (but possibly at a cost of more memory—the lower the load factor, the less the probability of a collision). As you mention, you do have the extra cost of calculating the hash function, and as you point out, this does involve visiting each element in the vector; in the binary tree, all that it required in each comparison is that enough elements are compared to determine order—in many cases, that may be just one or two. (But if you say that there are a lot of duplicates... you cannot detect a duplicate until you've visited all 30 entries, since any one may vary.) The only way to know which solution will actually be faster is to measure both, using typical data; for a data set such as you describe (many duplicates), I suspect the hash table will win, but it's far from certain. Finally, you can use some sort of non-binary tree. If you can really limit the values to a specific range (e.g. -100..100), you can use an ordinary vector or array with pointers to the subnodes, indexing directly with the element value, transposed as necessary. You then just walk the tree until either you find a null pointer, or you reach the end. The maximum depth of the tree will be 30, and in fact, every element will be 30 deep, but typically, you'll find that the element is unique before getting that deep. I suspect (but again, you'ld need to measure) that in your case, with many duplicates, this will in fact be significantly slower than the previous two suggestions. (And it would be a lot more work on your part, because I'm not aware of any existing implementations.) As for hashing, just about any form of linear congruent hashing should be sufficient: FNV, for example. Most of the documentation for such hashes concerns strings (arrays of <code>char</code>), but they tend to work just as well with any integral type. I've generally used something like: <pre><code>template <typename ForwardIterator> size_t hash( ForwardIterator begin, ForwardIterator end ) { size_t results = 2166136261U for ( ForwardIterator current = begin; current != end; ++ current ) { results = 127 * results + static_cast<size_t>( *current ); } return results; } </code></pre> My choice of <code>127</code> as a multiplier is largely based on speed in older systems: multiplying by 127 is a lot faster than most of the other values which give good results. (I have no idea whether this is still true. But multiplication is still a relatively slow operation on many machines, and the compiler will convert <code>127 * x</code> into something like <code>x << 7 - x</code> if that is faster.) The distribution with the above algorithm is about as good as that for FNV, at least with the data sets I've tested.
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POFast ways to remove duplicates from a list of integer vectors
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. This table or related slice is empty.
UserOwnerUserId
1. USJames Kanze
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POFast ways to remove duplicates from a list of integer vectors
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. COThanks for your answer. I played around with this a bit, and the hash table is a lot slower than I expected. The computation of the hash function is about 3 times more expansive than 30 comparisons which are the worst case look-up in a trie. Also the std::unordered_map does several 30-loop comparisons when searching in a bucket. I guess the second problem can be rectified by optimizing bucket number etc. So far the trie turned out to be a lot faster.
 singulars
 PostPostId
 PO
 UserUserId
 USThomas
2. COOK, I got it working fast with stl::unordered_set now. I made some implementation errors. Thanks again.
 singulars
 PostPostId
 PO
 UserUserId
 USThomas

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.