StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
10400197
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
0
CommunityOwnedDate
CreationDate
2012-05-01T15:43:59.523
FavoriteCount
0
LastActivityDate
2012-11-21T11:06:54.757
LastEditDate
2012-11-21T11:06:54.757
LastEditorUserId
100190
OwnerUserId
97141
ParentId
10348929
PostTypeId
2
Score
6
ViewCount
0
LastEditorDisplayName
text
Body
<ol> <li>First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the <code>MD5</code> hash (or something faster) of the resulting string. Do a database lookup of the <code>MD5</code> hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.</li> <li>To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the <code>SpotSigs</code> paper and <a href="http://glinden.blogspot.com/2008/08/clever-method-of-near-duplicate.html" rel="noreferrer">blog post</a> by Greg Linden. Suppose the routine <code>Sigs()</code> does that for a given string, that is, given the normalized string <code>x</code>, <code>Sigs(x)</code> returns a small (1-5) set of 64 bit integers. You could use something like the <code>SpotSigs</code> algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is <a href="http://code.google.com/p/simhash/" rel="noreferrer">here</a>).</li> <li>Given the <code>Sigs()</code> the problem of efficiently finding the near duplicates is commonly called the <a href="http://bit.ly/JPTZ2I" rel="noreferrer">set similarity joins</a> problem. The <code>SpotSigs</code> paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the <code>simhash</code> method.</li> </ol>
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. PONear Duplicate Detection in Data Streams
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USMatti Lyra
UserOwnerUserId
1. USJeff Kubina
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. PONear Duplicate Detection in Data Streams
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTAcceptedByOriginator
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.