StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
6401068
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
1
CommunityOwnedDate
CreationDate
2011-06-19T06:58:05.380
FavoriteCount
0
LastActivityDate
2011-06-19T07:54:02.723
LastEditDate
2011-06-19T07:54:02.723
LastEditorUserId
584347
OwnerUserId
584347
ParentId
6400416
PostTypeId
2
Score
37
ViewCount
0
LastEditorDisplayName
text
Body
I've recently done a similar task, although I was matching new data to existing names in a database, rather than looking for duplicates within one set. Name matching is actually a well-studied task, with a number of factors beyond what you'd consider for matching generic strings. First, I'd recommend taking a look at a paper, How to play the “Names Game”: Patent retrieval comparing different heuristics by Raffo and Lhuillery. The published version is <a href="http://www.sciencedirect.com/science/article/pii/S0048733309001528" rel="noreferrer">here</a>, and a PDF is freely available <a href="http://infoscience.epfl.ch/record/161961/files/Lhuillery%20RP2009b.pdf" rel="noreferrer">here</a>. The authors provide a nice summary, comparing a number of different matching strategies. They consider three stages, which they call parsing, matching, and filtering. Parsing consists of applying various cleaning techniques. Some examples: <ul> <li>Standardizing lettercase (e.g., all lowercase)</li> <li>Standardizing punctuation (e.g., commas must be followed by spaces)</li> <li>Standardizing whitespace (e.g., converting all runs of whitespace to single spaces)</li> <li>Standardizing accented and special characters (e.g., converting accented letters to ASCII equivalents)</li> <li>Standardizing legal control terms (e.g., converting "Co." to "Company")</li> </ul> In my case, I folded all letters to lowercase, replaced all punctuation with whitespace, replaced accented characters by unaccented counterparts, removed all other special characters, and removed legal control terms from the beginning and ends of the names following a list. Matching is the comparison of the parsed names. This could be simple string matching, edit distance, Soundex or Metaphone, comparison of the sets of words making up the names, or comparison of sets of letters or n-grams (letter sequences of length n). The n-gram approach is actually quite nice for names, as it ignores word order, helping a lot with things like "department of examples" vs. "examples department". In fact, comparing bigrams (2-grams, character pairs) using something simple like the <a href="http://en.wikipedia.org/wiki/Jaccard_index" rel="noreferrer">Jaccard index</a> is very effective. In contrast to several other suggestions, Levenshtein distance is one of the poorer approaches when it comes to name matching. In my case, I did the matching in two steps, first with comparing the parsed names for equality and then using the Jaccard index for the sets of bigrams on the remaining. Rather than actually calculating all the Jaccard index values for all pairs of names, I first put a bound on the maximum possible value for the Jaccard index for two sets of given size, and only computed the Jaccard index if that upper bound was high enough to potentially be useful. Most of the name pairs were still dissimilar enough that they weren't matches, but it dramatically reduced the number of comparisons made. Filtering is the use of auxiliary data to reject false positives from the parsing and matching stages. A simple version would be to see if matching names correspond to businesses in different cities, and thus different businesses. That example could be applied before matching, as a kind of pre-filtering. More complicated or time-consuming checks might be applied afterwards. I didn't do much filtering. I checked the countries for the firms to see if they were the same, and that was it. There weren't really that many possibilities in the data, some time constraints ruled out any extensive search for additional data to augment the filtering, and there was a manual checking planned, anyway. 
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POFigure out if a business name is very similar to another one - Python
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USMichael J. Barber
UserOwnerUserId
1. USMichael J. Barber
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. POFigure out if a business name is very similar to another one - Python
 singulars
 PostTypePostTypeId
 PTQuestion
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId
1. This table or related slice is empty.

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.